Skip to content
Snippets Groups Projects
  • Christian Brauner's avatar
    0a0c4585
    binfmt_misc: cleanup on filesystem umount · 0a0c4585
    Christian Brauner authored
    [ Upstream commit 1c5976ef ]
    
    Currently, registering a new binary type pins the binfmt_misc
    filesystem. Specifically, this means that as long as there is at least
    one binary type registered the binfmt_misc filesystem survives all
    umounts, i.e. the superblock is not destroyed. Meaning that a umount
    followed by another mount will end up with the same superblock and the
    same binary type handlers. This is a behavior we tend to discourage for
    any new filesystems (apart from a few special filesystems such as e.g.
    configfs or debugfs). A umount operation without the filesystem being
    pinned - by e.g. someone holding a file descriptor to an open file -
    should usually result in the destruction of the superblock and all
    associated resources. This makes introspection easier and leads to
    clearly defined, simple and clean semantics. An administrator can rely
    on the fact that a umount will guarantee a clean slate making it
    possible to reinitialize a filesystem. Right now all binary types would
    need to be explicitly deleted before that can happen.
    
    This allows us to remove the heavy-handed calls to simple_pin_fs() and
    simple_release_fs() when creating and deleting binary types. This in
    turn allows us to replace the current brittle pinning mechanism abusing
    dget() which has caused a range of bugs judging from prior fixes in [2]
    and [3]. The additional dget() in load_misc_binary() pins the dentry but
    only does so for the sake to prevent ->evict_inode() from freeing the
    node when a user removes the binary type and kill_node() is run. Which
    would mean ->interpreter and ->interp_file would be freed causing a UAF.
    
    This isn't really nicely documented nor is it very clean because it
    relies on simple_pin_fs() pinning the filesystem as long as at least one
    binary type exists. Otherwise it would cause load_misc_binary() to hold
    on to a dentry belonging to a superblock that has been shutdown.
    Replace that implicit pinning with a clean and simple per-node refcount
    and get rid of the ugly dget() pinning. A similar mechanism exists for
    e.g. binderfs (cf. [4]). All the cleanup work can now be done in
    ->evict_inode().
    
    In a follow-up patch we will make it possible to use binfmt_misc in
    sandboxes. We will use the cleaner semantics where a umount for the
    filesystem will cause the superblock and all resources to be
    deallocated. In preparation for this apply the same semantics to the
    initial binfmt_misc mount. Note, that this is a user-visible change and
    as such a uapi change but one that we can reasonably risk. We've
    discussed this in earlier versions of this patchset (cf. [1]).
    
    The main user and provider of binfmt_misc is systemd. Systemd provides
    binfmt_misc via autofs since it is configurable as a kernel module and
    is used by a few exotic packages and users. As such a binfmt_misc mount
    is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
    provided on demand. Other autofs on demand filesystems include EFI ESP
    which systemd umounts if the mountpoint stays idle for a certain amount
    of time. This doesn't apply to the binfmt_misc autofs mount which isn't
    touched once it is mounted meaning this change can't accidently wipe
    binary type handlers without someone having explicitly unmounted
    binfmt_misc. After speaking to systemd folks they don't expect this
    change to affect them.
    
    In line with our general policy, if we see a regression for systemd or
    other users with this change we will switch back to the old behavior for
    the initial binfmt_misc mount and have binary types pin the filesystem
    again. But while we touch this code let's take the chance and let's
    improve on the status quo.
    
    [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
    [2]: commit 43a4f261 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
    [3]: commit 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
    [4]: commit f0fe2c0f ("binder: prevent UAF for binderfs devices II")
    
    Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org
    
     (v1)
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Henning Schild <henning.schild@siemens.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Laurent Vivier <laurent@vivier.eu>
    Cc: linux-fsdevel@vger.kernel.org
    Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    0a0c4585
    History
    binfmt_misc: cleanup on filesystem umount
    Christian Brauner authored
    [ Upstream commit 1c5976ef ]
    
    Currently, registering a new binary type pins the binfmt_misc
    filesystem. Specifically, this means that as long as there is at least
    one binary type registered the binfmt_misc filesystem survives all
    umounts, i.e. the superblock is not destroyed. Meaning that a umount
    followed by another mount will end up with the same superblock and the
    same binary type handlers. This is a behavior we tend to discourage for
    any new filesystems (apart from a few special filesystems such as e.g.
    configfs or debugfs). A umount operation without the filesystem being
    pinned - by e.g. someone holding a file descriptor to an open file -
    should usually result in the destruction of the superblock and all
    associated resources. This makes introspection easier and leads to
    clearly defined, simple and clean semantics. An administrator can rely
    on the fact that a umount will guarantee a clean slate making it
    possible to reinitialize a filesystem. Right now all binary types would
    need to be explicitly deleted before that can happen.
    
    This allows us to remove the heavy-handed calls to simple_pin_fs() and
    simple_release_fs() when creating and deleting binary types. This in
    turn allows us to replace the current brittle pinning mechanism abusing
    dget() which has caused a range of bugs judging from prior fixes in [2]
    and [3]. The additional dget() in load_misc_binary() pins the dentry but
    only does so for the sake to prevent ->evict_inode() from freeing the
    node when a user removes the binary type and kill_node() is run. Which
    would mean ->interpreter and ->interp_file would be freed causing a UAF.
    
    This isn't really nicely documented nor is it very clean because it
    relies on simple_pin_fs() pinning the filesystem as long as at least one
    binary type exists. Otherwise it would cause load_misc_binary() to hold
    on to a dentry belonging to a superblock that has been shutdown.
    Replace that implicit pinning with a clean and simple per-node refcount
    and get rid of the ugly dget() pinning. A similar mechanism exists for
    e.g. binderfs (cf. [4]). All the cleanup work can now be done in
    ->evict_inode().
    
    In a follow-up patch we will make it possible to use binfmt_misc in
    sandboxes. We will use the cleaner semantics where a umount for the
    filesystem will cause the superblock and all resources to be
    deallocated. In preparation for this apply the same semantics to the
    initial binfmt_misc mount. Note, that this is a user-visible change and
    as such a uapi change but one that we can reasonably risk. We've
    discussed this in earlier versions of this patchset (cf. [1]).
    
    The main user and provider of binfmt_misc is systemd. Systemd provides
    binfmt_misc via autofs since it is configurable as a kernel module and
    is used by a few exotic packages and users. As such a binfmt_misc mount
    is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
    provided on demand. Other autofs on demand filesystems include EFI ESP
    which systemd umounts if the mountpoint stays idle for a certain amount
    of time. This doesn't apply to the binfmt_misc autofs mount which isn't
    touched once it is mounted meaning this change can't accidently wipe
    binary type handlers without someone having explicitly unmounted
    binfmt_misc. After speaking to systemd folks they don't expect this
    change to affect them.
    
    In line with our general policy, if we see a regression for systemd or
    other users with this change we will switch back to the old behavior for
    the initial binfmt_misc mount and have binary types pin the filesystem
    again. But while we touch this code let's take the chance and let's
    improve on the status quo.
    
    [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
    [2]: commit 43a4f261 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
    [3]: commit 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
    [4]: commit f0fe2c0f ("binder: prevent UAF for binderfs devices II")
    
    Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org
    
     (v1)
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Henning Schild <henning.schild@siemens.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Laurent Vivier <laurent@vivier.eu>
    Cc: linux-fsdevel@vger.kernel.org
    Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    Signed-off-by: default avatarSasha Levin <sashal@kernel.org>