Skip to content
Snippets Groups Projects
  1. Jan 14, 2025
  2. Jul 11, 2024
  3. Dec 12, 2023
  4. Oct 31, 2023
  5. Oct 12, 2023
  6. Aug 17, 2023
  7. Feb 09, 2023
  8. Feb 01, 2023
  9. Jan 04, 2023
  10. Dec 31, 2022
  11. Sep 24, 2022
  12. Sep 01, 2022
  13. Aug 31, 2022
  14. Aug 17, 2022
    • Al Viro's avatar
      Change calling conventions for filldir_t · 25885a35
      Al Viro authored
      
      filldir_t instances (directory iterators callbacks) used to return 0 for
      "OK, keep going" or -E... for "stop".  Note that it's *NOT* how the
      error values are reported - the rules for those are callback-dependent
      and ->iterate{,_shared}() instances only care about zero vs. non-zero
      (look at emit_dir() and friends).
      
      So let's just return bool ("should we keep going?") - it's less confusing
      that way.  The choice between "true means keep going" and "true means
      stop" is bikesheddable; we have two groups of callbacks -
      	do something for everything in directory, until we run into problem
      and
      	find an entry in directory and do something to it.
      
      The former tended to use 0/-E... conventions - -E<something> on failure.
      The latter tended to use 0/1, 1 being "stop, we are done".
      The callers treated anything non-zero as "stop", ignoring which
      non-zero value did they get.
      
      "true means stop" would be more natural for the second group; "true
      means keep going" - for the first one.  I tried both variants and
      the things like
      	if allocation failed
      		something = -ENOMEM;
      		return true;
      just looked unnatural and asking for trouble.
      
      [folded suggestion from Matthew Wilcox <willy@infradead.org>]
      Acked-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      25885a35
    • Christian Brauner's avatar
      acl: handle idmapped mounts for idmapped filesystems · abfcf55d
      Christian Brauner authored
      Ensure that POSIX ACLs checking, getting, and setting works correctly
      for filesystems mountable with a filesystem idmapping ("fs_idmapping")
      that want to support idmapped mounts ("mnt_idmapping").
      
      Note that no filesystems mountable with an fs_idmapping do yet support
      idmapped mounts. This is required infrastructure work to unblock this.
      
      As we explained in detail in [1] the fs_idmapping is irrelevant for
      getxattr() and setxattr() when mapping the ACL_{GROUP,USER} {g,u}ids
      stored in the uapi struct posix_acl_xattr_entry in
      posix_acl_fix_xattr_{from,to}_user().
      
      But for acl_permission_check() and posix_acl_{g,s}etxattr_idmapped_mnt()
      the fs_idmapping matters.
      
      acl_permission_check():
        During lookup POSIX ACLs are retrieved directly via i_op->get_acl() and
        are returned via the kernel internal struct posix_acl which contains
        e_{g,u}id members of type k{g,u}id_t that already take the
        fs_idmapping into acccount.
      
        For example, a POSIX ACL stored with u4 on the backing store is mapped
        to k10000004 in the fs_idmapping. The mnt_idmapping remaps the POSIX ACL
        to k20000004. In order to do that the fs_idmapping needs to be taken
        into account but that doesn't happen yet (Again, this is a
        counterfactual currently as fuse doesn't support idmapped mounts
        currently. It's just used as a convenient example.):
      
        fs_idmapping:  u0:k10000000:r65536
        mnt_idmapping: u0:v20000000:r65536
        ACL_USER:      k10000004
      
        acl_permission_check()
        -> check_acl()
           -> get_acl()
              -> i_op->get_acl() == fuse_get_acl()
                 -> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
                    {
                            k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
                                                  u4 /* ACL_USER */);
                    }
           -> posix_acl_permission()
              {
                      -1 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
                                       &init_user_ns,
                                       k10000004);
                      vfsuid_eq_kuid(-1, k10000004 /* caller_fsuid */)
              }
      
        In order to correctly map from the fs_idmapping into mnt_idmapping we
        require the relevant fs_idmaping to be passed:
      
        acl_permission_check()
        -> check_acl()
           -> get_acl()
              -> i_op->get_acl() == fuse_get_acl()
                 -> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
                    {
                            k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
                                                  u4 /* ACL_USER */);
                    }
           -> posix_acl_permission()
              {
                      v20000004 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
                                              u0:k10000000:r65536 /* fs_idmapping */,
                                              k10000004);
                      vfsuid_eq_kuid(v20000004, k10000004 /* caller_fsuid */)
              }
      
        The initial_idmapping is only correct for the current situation because
        all filesystems that currently support idmapped mounts do not support
        being mounted with an fs_idmapping.
      
        Note that ovl_get_acl() is used to retrieve the POSIX ACLs from the
        relevant lower layer and the lower layer's mnt_idmapping needs to be
        taken into account and so does the fs_idmapping. See 0c5fd887 ("acl:
        move idmapped mount fixup into vfs_{g,s}etxattr()") for more details.
      
      For posix_acl_{g,s}etxattr_idmapped_mnt() it is not as obvious why the
      fs_idmapping matters as it is for acl_permission_check(). Especially
      because it doesn't matter for posix_acl_fix_xattr_{from,to}_user() (See
      [1] for more context.).
      
      Because posix_acl_{g,s}etxattr_idmapped_mnt() operate on the uapi
      struct posix_acl_xattr_entry which contains {g,u}id_t values and thus
      give the impression that the fs_idmapping is irrelevant as at this point
      appropriate {g,u}id_t values have seemlingly been generated.
      
      As we've stated multiple times this assumption is wrong and in fact the
      uapi struct posix_acl_xattr_entry is taking idmappings into account
      depending at what place it is operated on.
      
      posix_acl_getxattr_idmapped_mnt()
        When posix_acl_getxattr_idmapped_mnt() is called the values stored in
        the uapi struct posix_acl_xattr_entry are mapped according to the
        fs_idmapping. This happened when they were read from the backing store
        and then translated from struct posix_acl into the uapi
        struct posix_acl_xattr_entry during posix_acl_to_xattr().
      
        In other words, the fs_idmapping matters as the values stored as
        {g,u}id_t in the uapi struct posix_acl_xattr_entry have been generated
        by it.
      
        So we need to take the fs_idmapping into account during make_vfsuid()
        in posix_acl_getxattr_idmapped_mnt().
      
      posix_acl_setxattr_idmapped_mnt()
        When posix_acl_setxattr_idmapped_mnt() is called the values stored as
        {g,u}id_t in uapi struct posix_acl_xattr_entry are intended to be the
        values that ultimately get turned back into a k{g,u}id_t in
        posix_acl_from_xattr() (which turns the uapi
        struct posix_acl_xattr_entry into the kernel internal struct posix_acl).
      
        In other words, the fs_idmapping matters as the values stored as
        {g,u}id_t in the uapi struct posix_acl_xattr_entry are intended to be
        the values that will be undone in the fs_idmapping when writing to the
        backing store.
      
        So we need to take the fs_idmapping into account during from_vfsuid()
        in posix_acl_setxattr_idmapped_mnt().
      
      Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org
      
       [1]
      Fixes: 0c5fd887 ("acl: move idmapped mount fixup into vfs_{g,s}etxattr()")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Link: https://lore.kernel.org/r/20220816113514.43304-1-brauner@kernel.org
      abfcf55d
  15. Aug 02, 2022
  16. Jul 28, 2022
  17. Jul 27, 2022
    • Yang Li's avatar
      ovl: fix some kernel-doc comments · 9c5dd803
      Yang Li authored
      
      Remove warnings found by running scripts/kernel-doc,
      which is caused by using 'make W=1'.
      fs/overlayfs/super.c:311: warning: Function parameter or member 'dentry'
      not described in 'ovl_statfs'
      fs/overlayfs/super.c:311: warning: Excess function parameter 'sb'
      description in 'ovl_statfs'
      fs/overlayfs/super.c:357: warning: Function parameter or member 'm' not
      described in 'ovl_show_options'
      fs/overlayfs/super.c:357: warning: Function parameter or member 'dentry'
      not described in 'ovl_show_options'
      
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9c5dd803
    • Miklos Szeredi's avatar
      ovl: warn if trusted xattr creation fails · b10b85fe
      Miklos Szeredi authored
      
      When mounting overlayfs in an unprivileged user namespace, trusted xattr
      creation will fail.  This will lead to failures in some file operations,
      e.g. in the following situation:
      
        mkdir lower upper work merged
        mkdir lower/directory
        mount -toverlay -olowerdir=lower,upperdir=upper,workdir=work none merged
        rmdir merged/directory
        mkdir merged/directory
      
      The last mkdir will fail:
      
        mkdir: cannot create directory 'merged/directory': Input/output error
      
      The cause for these failures is currently extremely non-obvious and hard to
      debug.  Hence, warn the user and suggest using the userxattr mount option,
      if it is not already supplied and xattr creation fails during the
      self-check.
      
      Reported-by: default avatarAlois Wohlschlager <alois1@gmx-topmail.de>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b10b85fe
  18. Jul 16, 2022
  19. Jul 15, 2022
    • Christian Brauner's avatar
      Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily" · 7c4d37c2
      Christian Brauner authored
      
      This reverts commit 4a47c638.
      
      Now that we have a proper fix for POSIX ACLs with overlayfs on top of
      idmapped layers revert the temporary fix.
      
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      7c4d37c2
    • Christian Brauner's avatar
      ovl: handle idmappings in ovl_get_acl() · 1aa5fef5
      Christian Brauner authored
      During permission checking overlayfs will call
      
      ovl_permission()
      -> generic_permission()
         -> acl_permission_check()
            -> check_acl()
               -> get_acl()
                  -> inode->i_op->get_acl() == ovl_get_acl()
                     -> get_acl() /* on the underlying filesystem */
                        -> inode->i_op->get_acl() == /*lower filesystem callback */
               -> posix_acl_permission()
      
      passing through the get_acl() request to the underlying filesystem.
      
      Before returning these values to the VFS we need to take the idmapping of the
      relevant layer into account and translate any ACL_{GROUP,USER} values according
      to the idmapped mount.
      
      We cannot alter the ACLs returned from the relevant layer directly as that
      would alter the cached values filesystem wide for the lower filesystem. Instead
      we can clone the ACLs and then apply the relevant idmapping of the layer.
      
      This is obviously only relevant when idmapped layers are used.
      
      Link: https://lore.kernel.org/r/20220708090134.385160-4-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: linux-unionfs@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      1aa5fef5
    • Christian Brauner's avatar
      acl: move idmapped mount fixup into vfs_{g,s}etxattr() · 0c5fd887
      Christian Brauner authored
      This cycle we added support for mounting overlayfs on top of idmapped mounts.
      Recently I've started looking into potential corner cases when trying to add
      additional tests and I noticed that reporting for POSIX ACLs is currently wrong
      when using idmapped layers with overlayfs mounted on top of it.
      
      I'm going to give a rather detailed explanation to both the origin of the
      problem and the solution.
      
      Let's assume the user creates the following directory layout and they have a
      rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
      expect files on your host system to be owned. For example, ~/.bashrc for your
      regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
      0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
      filesystem.
      
      The user chooses to set POSIX ACLs using the setfacl binary granting the user
      with uid 4 read, write, and execute permissions for their .bashrc file:
      
              setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      Now they to expose the whole rootfs to a container using an idmapped mount. So
      they first create:
      
              mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
              mkdir -pv /vol/contpool/ctrover/{over,work}
              chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
      
      The user now creates an idmapped mount for the rootfs:
      
              mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                            /var/lib/lxc/c2/rootfs \
                                            /vol/contpool/lowermap
      
      This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
      /vol/contpool/lowermap/home/ubuntu/.bashrc.
      
      Assume the user wants to expose these idmapped mounts through an overlayfs
      mount to a container.
      
             mount -t overlay overlay                      \
                   -o lowerdir=/vol/contpool/lowermap,     \
                      upperdir=/vol/contpool/overmap/over, \
                      workdir=/vol/contpool/overmap/work   \
                   /vol/contpool/merge
      
      The user can do this in two ways:
      
      (1) Mount overlayfs in the initial user namespace and expose it to the
          container.
      (2) Mount overlayfs on top of the idmapped mounts inside of the container's
          user namespace.
      
      Let's assume the user chooses the (1) option and mounts overlayfs on the host
      and then changes into a container which uses the idmapping 0:10000000:65536
      which is the same used for the two idmapped mounts.
      
      Now the user tries to retrieve the POSIX ACLs using the getfacl command
      
              getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
      
      and to their surprise they see:
      
              # file: vol/contpool/merge/home/ubuntu/.bashrc
              # owner: 1000
              # group: 1000
              user::rw-
              user:4294967295:rwx
              group::r--
              mask::rwx
              other::r--
      
      indicating the the uid wasn't correctly translated according to the idmapped
      mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
      callchain in this example:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      If the user chooses to use option (2) and mounts overlayfs on top of idmapped
      mounts inside the container things don't look that much better:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      As is easily seen the problem arises because the idmapping of the lower mount
      isn't taken into account as all of this happens in do_gexattr(). But
      do_getxattr() is always called on an overlayfs mount and inode and thus cannot
      possible take the idmapping of the lower layers into account.
      
      This problem is similar for fscaps but there the translation happens as part of
      vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:
      
              setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      The expected outcome here is that we'll receive the cap_net_raw capability as
      we are able to map the uid associated with the fscap to 0 within our container.
      IOW, we want to see 0 as the result of the idmapping translations.
      
      If the user chooses option (1) we get the following callchain for fscaps:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                       ________________________________
                                  -> cap_inode_getsecurity()                                         |                              |
                                     {                                                               V                              |
                                              10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                              10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                                     /* Expected result is 0 and thus that we own the fscap. */                     |
                                                     0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                                     }                                                                                              |
                                     -> vfs_getxattr_alloc()                                                                        |
                                        -> handler->get == ovl_other_xattr_get()                                                    |
                                           -> vfs_getxattr()                                                                        |
                                              -> xattr_getsecurity()                                                                |
                                                 -> security_inode_getsecurity()                                                    |
                                                    -> cap_inode_getsecurity()                                                      |
                                                       {                                                                            |
                                                                      0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                               10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                               10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                               |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      And if the user chooses option (2) we get:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                                _______________________________
                                  -> cap_inode_getsecurity()                                                  |                             |
                                     {                                                                        V                             |
                                             10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                             10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                                     /* Expected result is 0 and thus that we own the fscap. */                             |
                                                    0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                                     }                                                                                                      |
                                     -> vfs_getxattr_alloc()                                                                                |
                                        -> handler->get == ovl_other_xattr_get()                                                            |
                                          |-> vfs_getxattr()                                                                                |
                                              -> xattr_getsecurity()                                                                        |
                                                 -> security_inode_getsecurity()                                                            |
                                                    -> cap_inode_getsecurity()                                                              |
                                                       {                                                                                    |
                                                                       0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                                10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                       0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                       |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      We can see how the translation happens correctly in those cases as the
      conversion happens within the vfs_getxattr() helper.
      
      For POSIX ACLs we need to do something similar. However, in contrast to fscaps
      we cannot apply the fix directly to the kernel internal posix acl data
      structure as this would alter the cached values and would also require a rework
      of how we currently deal with POSIX ACLs in general which almost never take the
      filesystem idmapping into account (the noteable exception being FUSE but even
      there the implementation is special) and instead retrieve the raw values based
      on the initial idmapping.
      
      The correct values are then generated right before returning to userspace. The
      fix for this is to move taking the mount's idmapping into account directly in
      vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().
      
      To this end we split out two small and unexported helpers
      posix_acl_getxattr_idmapped_mnt() and posix_acl_setxattr_idmapped_mnt(). The
      former to be called in vfs_getxattr() and the latter to be called in
      vfs_setxattr().
      
      Let's go back to the original example. Assume the user chose option (1) and
      mounted overlayfs on top of idmapped mounts on the host:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           |
                        |                                                 V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(&init_user_ns, 10000004);
                        |     }       |_________________________________________________
                        |                                                              |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                           }
      
      And similarly if the user chooses option (1) and mounted overayfs on top of
      idmapped mounts inside the container:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                        |             |_________________________________________________
                        |     }                                                        |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                           }
      
      The last remaining problem we need to fix here is ovl_get_acl(). During
      ovl_permission() overlayfs will call:
      
              ovl_permission()
              -> generic_permission()
                 -> acl_permission_check()
                    -> check_acl()
                       -> get_acl()
                          -> inode->i_op->get_acl() == ovl_get_acl()
                              > get_acl() /* on the underlying filesystem)
                                ->inode->i_op->get_acl() == /*lower filesystem callback */
                       -> posix_acl_permission()
      
      passing through the get_acl request to the underlying filesystem. This will
      retrieve the acls stored in the lower filesystem without taking the idmapping
      of the underlying mount into account as this would mean altering the cached
      values for the lower filesystem. So we block using ACLs for now until we
      decided on a nice way to fix this. Note this limitation both in the
      documentation and in the code.
      
      The most straightforward solution would be to have ovl_get_acl() simply
      duplicate the ACLs, update the values according to the idmapped mount and
      return it to acl_permission_check() so it can be used in posix_acl_permission()
      forgetting them afterwards. This is a bit heavy handed but fairly
      straightforward otherwise.
      
      Link: https://github.com/brauner/mount-idmapped/issues/9
      Link: https://lore.kernel.org/r/20220708090134.385160-2-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: linux-unionfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      0c5fd887
  20. Jul 08, 2022
    • Christian Brauner's avatar
      ovl: turn of SB_POSIXACL with idmapped layers temporarily · 4a47c638
      Christian Brauner authored
      This cycle we added support for mounting overlayfs on top of idmapped
      mounts.  Recently I've started looking into potential corner cases when
      trying to add additional tests and I noticed that reporting for POSIX ACLs
      is currently wrong when using idmapped layers with overlayfs mounted on top
      of it.
      
      I have sent out an patch that fixes this and makes POSIX ACLs work
      correctly but the patch is a bit bigger and we're already at -rc5 so I
      recommend we simply don't raise SB_POSIXACL when idmapped layers are
      used. Then we can fix the VFS part described below for the next merge
      window so we can have good exposure in -next.
      
      I'm going to give a rather detailed explanation to both the origin of the
      problem and mention the solution so people know what's going on.
      
      Let's assume the user creates the following directory layout and they have
      a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
      would expect files on your host system to be owned. For example, ~/.bashrc
      for your regular user would be owned by 1000:1000 and /root/.bashrc would
      be owned by 0:0. IOW, this is just regular boring filesystem tree on an
      ext4 or xfs filesystem.
      
      The user chooses to set POSIX ACLs using the setfacl binary granting the
      user with uid 4 read, write, and execute permissions for their .bashrc
      file:
      
              setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      Now they to expose the whole rootfs to a container using an idmapped
      mount. So they first create:
      
              mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
              mkdir -pv /vol/contpool/ctrover/{over,work}
              chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
      
      The user now creates an idmapped mount for the rootfs:
      
              mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                            /var/lib/lxc/c2/rootfs \
                                            /vol/contpool/lowermap
      
      This for example makes it so that
      /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
      1000 as being owned by uid and gid 10001000 at
      /vol/contpool/lowermap/home/ubuntu/.bashrc.
      
      Assume the user wants to expose these idmapped mounts through an overlayfs
      mount to a container.
      
             mount -t overlay overlay                      \
                   -o lowerdir=/vol/contpool/lowermap,     \
                      upperdir=/vol/contpool/overmap/over, \
                      workdir=/vol/contpool/overmap/work   \
                   /vol/contpool/merge
      
      The user can do this in two ways:
      
      (1) Mount overlayfs in the initial user namespace and expose it to the
          container.
      
      (2) Mount overlayfs on top of the idmapped mounts inside of the container's
          user namespace.
      
      Let's assume the user chooses the (1) option and mounts overlayfs on the
      host and then changes into a container which uses the idmapping
      0:10000000:65536 which is the same used for the two idmapped mounts.
      
      Now the user tries to retrieve the POSIX ACLs using the getfacl command
      
              getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
      
      and to their surprise they see:
      
              # file: vol/contpool/merge/home/ubuntu/.bashrc
              # owner: 1000
              # group: 1000
              user::rw-
              user:4294967295:rwx
              group::r--
              mask::rwx
              other::r--
      
      indicating the uid wasn't correctly translated according to the idmapped
      mount. The problem is how we currently translate POSIX ACLs. Let's inspect
      the callchain in this example:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      If the user chooses to use option (2) and mounts overlayfs on top of
      idmapped mounts inside the container things don't look that much better:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      As is easily seen the problem arises because the idmapping of the lower
      mount isn't taken into account as all of this happens in do_gexattr(). But
      do_getxattr() is always called on an overlayfs mount and inode and thus
      cannot possible take the idmapping of the lower layers into account.
      
      This problem is similar for fscaps but there the translation happens as
      part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
      callchain:
      
              setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      The expected outcome here is that we'll receive the cap_net_raw capability
      as we are able to map the uid associated with the fscap to 0 within our
      container.  IOW, we want to see 0 as the result of the idmapping
      translations.
      
      If the user chooses option (1) we get the following callchain for fscaps:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                       ________________________________
                                  -> cap_inode_getsecurity()                                         |                              |
                                     {                                                               V                              |
                                              10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                              10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                                     /* Expected result is 0 and thus that we own the fscap. */                     |
                                                     0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                                     }                                                                                              |
                                     -> vfs_getxattr_alloc()                                                                        |
                                        -> handler->get == ovl_other_xattr_get()                                                    |
                                           -> vfs_getxattr()                                                                        |
                                              -> xattr_getsecurity()                                                                |
                                                 -> security_inode_getsecurity()                                                    |
                                                    -> cap_inode_getsecurity()                                                      |
                                                       {                                                                            |
                                                                      0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                               10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                               10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                               |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      And if the user chooses option (2) we get:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                                _______________________________
                                  -> cap_inode_getsecurity()                                                  |                             |
                                     {                                                                        V                             |
                                             10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                             10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                                     /* Expected result is 0 and thus that we own the fscap. */                             |
                                                    0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                                     }                                                                                                      |
                                     -> vfs_getxattr_alloc()                                                                                |
                                        -> handler->get == ovl_other_xattr_get()                                                            |
                                          |-> vfs_getxattr()                                                                                |
                                              -> xattr_getsecurity()                                                                        |
                                                 -> security_inode_getsecurity()                                                            |
                                                    -> cap_inode_getsecurity()                                                              |
                                                       {                                                                                    |
                                                                       0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                                10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                       0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                       |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      We can see how the translation happens correctly in those cases as the
      conversion happens within the vfs_getxattr() helper.
      
      For POSIX ACLs we need to do something similar. However, in contrast to
      fscaps we cannot apply the fix directly to the kernel internal posix acl
      data structure as this would alter the cached values and would also require
      a rework of how we currently deal with POSIX ACLs in general which almost
      never take the filesystem idmapping into account (the noteable exception
      being FUSE but even there the implementation is special) and instead
      retrieve the raw values based on the initial idmapping.
      
      The correct values are then generated right before returning to
      userspace. The fix for this is to move taking the mount's idmapping into
      account directly in vfs_getxattr() instead of having it be part of
      posix_acl_fix_xattr_to_user().
      
      To this end we simply move the idmapped mount translation into a separate
      step performed in vfs_{g,s}etxattr() instead of in
      posix_acl_fix_xattr_{from,to}_user().
      
      To see how this fixes things let's go back to the original example. Assume
      the user chose option (1) and mounted overlayfs on top of idmapped mounts
      on the host:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           |
                        |                                                 V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(&init_user_ns, 10000004);
                        |     }       |_________________________________________________
                        |                                                              |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                           }
      
      And similarly if the user chooses option (1) and mounted overayfs on top of
      idmapped mounts inside the container:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                        |             |_________________________________________________
                        |     }                                                        |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                           }
      
      The last remaining problem we need to fix here is ovl_get_acl(). During
      ovl_permission() overlayfs will call:
      
              ovl_permission()
              -> generic_permission()
                 -> acl_permission_check()
                    -> check_acl()
                       -> get_acl()
                          -> inode->i_op->get_acl() == ovl_get_acl()
                              > get_acl() /* on the underlying filesystem)
                                ->inode->i_op->get_acl() == /*lower filesystem callback */
                       -> posix_acl_permission()
      
      passing through the get_acl request to the underlying filesystem. This will
      retrieve the acls stored in the lower filesystem without taking the
      idmapping of the underlying mount into account as this would mean altering
      the cached values for the lower filesystem. The simple solution is to have
      ovl_get_acl() simply duplicate the ACLs, update the values according to the
      idmapped mount and return it to acl_permission_check() so it can be used in
      posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
      released right after.
      
      Link: https://github.com/brauner/mount-idmapped/issues/9
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: linux-unionfs@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Fixes: bc70682a ("ovl: support idmapped layers")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      4a47c638
  21. Jun 26, 2022
    • Christian Brauner's avatar
      attr: port attribute changes to new types · b27c82e1
      Christian Brauner authored
      Now that we introduced new infrastructure to increase the type safety
      for filesystems supporting idmapped mounts port the first part of the
      vfs over to them.
      
      This ports the attribute changes codepaths to rely on the new better
      helpers using a dedicated type.
      
      Before this change we used to take a shortcut and place the actual
      values that would be written to inode->i_{g,u}id into struct iattr. This
      had the advantage that we moved idmappings mostly out of the picture
      early on but it made reasoning about changes more difficult than it
      should be.
      
      The filesystem was never explicitly told that it dealt with an idmapped
      mount. The transition to the value that needed to be stored in
      inode->i_{g,u}id appeared way too early and increased the probability of
      bugs in various codepaths.
      
      We know place the same value in struct iattr no matter if this is an
      idmapped mount or not. The vfs will only deal with type safe
      vfs{g,u}id_t. This makes it massively safer to perform permission checks
      as the type will tell us what checks we need to perform and what helpers
      we need to use.
      
      Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
      inode->i_{g,u}id since they are different types. Instead they need to
      use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
      vfs{g,u}id into the filesystem.
      
      The other nice effect is that filesystems like overlayfs don't need to
      care about idmappings explicitly anymore and can simply set up struct
      iattr accordingly directly.
      
      Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
      Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      b27c82e1
  22. May 10, 2022
  23. Apr 28, 2022
Loading