Skip to content
Snippets Groups Projects
  1. Aug 12, 2024
  2. Feb 12, 2024
    • Al Viro's avatar
      fast_dput(): handle underflows gracefully · a56aab1f
      Al Viro authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 504e08ce ]
      
      If refcount is less than 1, we should just warn, unlock dentry and
      return true, so that the caller doesn't try to do anything else.
      
      Taking care of that leaves the rest of "lockref_put_return() has
      failed" case equivalent to "decrement refcount and rejoin the
      normal slow path after the point where we grab ->d_lock".
      
      NOTE: lockref_put_return() is strictly a fastpath thing - unlike
      the rest of lockref primitives, it does not contain a fallback.
      Caller (and it looks like fast_dput() is the only legitimate one
      in the entire kernel) has to do that itself.  Reasons for
      lockref_put_return() failures:
      	* ->d_lock held by somebody
      	* refcount <= 0
      	* ... or an architecture not supporting lockref use of
      cmpxchg - sparc, anything non-SMP, config with spinlock debugging...
      
      We could add a fallback, but it would be a clumsy API - we'd have
      to distinguish between:
      	(1) refcount > 1 - decremented, lock not held on return
      	(2) refcount < 1 - left alone, probably no sense to hold the lock
      	(3) refcount is 1, no cmphxcg - decremented, lock held on return
      	(4) refcount is 1, cmphxcg supported - decremented, lock *NOT* held
      	    on return.
      We want to return with no lock held in case (4); that's the whole point of that
      thing.  We very much do not want to have the fallback in case (3) return without
      a lock, since the caller might have to retake it in that case.
      So it wouldn't be more convenient than doing the fallback in the caller and
      it would be very easy to screw up, especially since the test coverage would
      suck - no way to test (3) and (4) on the same kernel build.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a56aab1f
  3. Sep 24, 2022
    • Miklos Szeredi's avatar
      vfs: open inside ->tmpfile() · 863f144f
      Miklos Szeredi authored
      
      This is in preparation for adding tmpfile support to fuse, which requires
      that the tmpfile creation and opening are done as a single operation.
      
      Replace the 'struct dentry *' argument of i_op->tmpfile with
      'struct file *'.
      
      Call finish_open_simple() as the last thing in ->tmpfile() instances (may
      be omitted in the error case).
      
      Change d_tmpfile() argument to 'struct file *' as well to make callers more
      readable.
      
      Reviewed-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      863f144f
  4. Sep 19, 2022
  5. Aug 17, 2022
    • Linus Torvalds's avatar
      dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop · ae2a8236
      Linus Torvalds authored
      
      __d_lookup_rcu() is one of the hottest functions in the kernel on
      certain loads, and it is complicated by filesystems that might want to
      have their own name compare function.
      
      We can improve code generation by moving the test of DCACHE_OP_COMPARE
      outside the loop, which makes the loop itself much simpler, at the cost
      of some code duplication.  But both cases end up being simpler, and the
      "native" direct case-sensitive compare particularly so.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae2a8236
  6. Aug 02, 2022
  7. Jul 30, 2022
    • Sebastian Andrzej Siewior's avatar
      fs/dcache: Move wakeup out of i_seq_dir write held region. · 50417d22
      Sebastian Andrzej Siewior authored
      
      __d_add() and __d_move() wake up waiters on dentry::d_wait from within
      the i_seq_dir write held region.  This violates the PREEMPT_RT
      constraints as the wake up acquires wait_queue_head::lock which is a
      "sleeping" spinlock on RT.
      
      There is no requirement to do so. __d_lookup_unhash() has cleared
      DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
      queue head pointer to the caller, so the actual wake up can be postponed
      until the i_dir_seq write side critical section is left. The only
      requirement is that dentry::lock is held across the whole sequence
      including the wake up. The previous commit includes an analysis why this
      is considered safe.
      
      Move the wake up past end_dir_add() which leaves the i_dir_seq write side
      critical section and enables preemption.
      
      For non RT kernels there is no difference because preemption is still
      disabled due to dentry::lock being held, but it shortens the time between
      wake up and unlocking dentry::lock, which reduces the contention for the
      woken up waiter.
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      50417d22
    • Sebastian Andrzej Siewior's avatar
      fs/dcache: Move the wakeup from __d_lookup_done() to the caller. · 45f78b0a
      Sebastian Andrzej Siewior authored
      
      __d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
      not allowed to do that with preemption disabled, since the wakeup
      acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.
      
      Calling it under dentry->d_lock is not a problem, since that is also a
      "sleeping" spinlock on the same configs.  Unfortunately, two of its
      callers (__d_add() and __d_move()) are holding more than just ->d_lock
      and that needs to be dealt with.
      
      The key observation is that wakeup can be moved to any point before
      dropping ->d_lock.
      
      As a first step to solve this, move the wake up outside of the
      hlist_bl_lock() held section.
      
      This is safe because:
      
      Waiters get inserted into ->d_wait only after they'd taken ->d_lock
      and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
      woken up (and evicted from the queue) between the moment __d_lookup_done()
      has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
      since the waitqueue ->d_wait points to won't get destroyed without
      having __d_lookup_done(dentry) called (under ->d_lock).
      
      ->d_wait is set only by d_alloc_parallel() and only in case when
      it returns a freshly allocated in-lookup dentry.  Whenever that happens,
      we are guaranteed that __d_lookup_done() will be called for resulting
      dentry (under ->d_lock) before the wq in question gets destroyed.
      
      With two exceptions wq lives in call frame of the caller of
      d_alloc_parallel() and we have an explicit d_lookup_done() on the
      resulting in-lookup dentry before we leave that frame.
      
      One of those exceptions is nfs_call_unlink(), where wq is embedded into
      (dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
      nfs_async_unlink_release() after an explicit d_lookup_done() on the
      dentry wq went into.
      
      Remaining exception is d_add_ci(). There wq is what we'd found in
      ->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
      instances of ->d_lookup() and they must have been given an in-lookup
      dentry.  Which means that they'd been called by __lookup_slow() or
      lookup_open(), with wq in the call frame of one of those.
      
      Result of d_alloc_parallel() in d_add_ci() is fed to
      d_splice_alias(), which either returns non-NULL (and d_add_ci() does
      d_lookup_done()) or feeds dentry to __d_add() that will do
      __d_lookup_done() under ->d_lock.  That concludes the analysis.
      
      Let __d_lookup_unhash():
      
        1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
        2) Unhash the dentry
        3) Retrieve and clear dentry::d_wait
        4) Unlock the hash and return the retrieved waitqueue head pointer
        5) Let the caller handle the wake up.
        6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
           build failures for OOT code that used __d_lookup_done() and is not
           aware of the new return value.
      
      This does not yet solve the PREEMPT_RT problem completely because
      preemption is still disabled due to i_dir_seq being held for write. This
      will be addressed in subsequent steps.
      
      An alternative solution would be to switch the waitqueue to a simple
      waitqueue, but aside of Linus not being a fan of them, moving the wake up
      closer to the place where dentry::lock is unlocked reduces lock contention
      time for the woken up waiter.
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de
      
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      45f78b0a
    • Sebastian Andrzej Siewior's avatar
      fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT · cf634d54
      Sebastian Andrzej Siewior authored
      
      i_dir_seq is a sequence counter with a lock which is represented by the
      lowest bit. The writer atomically updates the counter which ensures that it
      can be modified by only one writer at a time. This requires preemption to
      be disabled across the write side critical section.
      
      On !PREEMPT_RT kernels this is implicit by the caller acquiring
      dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
      which means that a preempting writer or reader would live lock. It's
      therefore required to disable preemption explicitly.
      
      An alternative solution would be to replace i_dir_seq with a seqlock_t for
      PREEMPT_RT, but that comes with its own set of problems due to arbitrary
      lock nesting. A pure sequence count with an associated spinlock is not
      possible because the locks held by the caller are not necessarily related.
      
      As the critical section is small, disabling preemption is a sensible
      solution.
      
      Reported-by: default avatar <Oleg.Karfich@wago.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lkml.kernel.org/r/20220613140712.77932-2-bigeasy@linutronix.de
      
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cf634d54
    • Al Viro's avatar
      d_add_ci(): make sure we don't miss d_lookup_done() · 40a3cb0d
      Al Viro authored
      
      All callers of d_alloc_parallel() must make sure that resulting
      in-lookup dentry (if any) will encounter __d_lookup_done() before
      the final dput().  d_add_ci() might end up creating in-lookup
      dentries; they are fed to d_splice_alias(), which will normally
      make sure they meet __d_lookup_done().  However, it is possible
      to end up with d_splice_alias() failing with ERR_PTR(-ELOOP)
      without having done so.  It takes a corrupted ntfs or case-insensitive
      xfs image, but neither should end up with memory corruption...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      40a3cb0d
  8. Mar 22, 2022
  9. Jan 22, 2022
  10. Apr 16, 2021
  11. Feb 24, 2021
  12. Jan 21, 2021
  13. Jan 16, 2021
    • Al Viro's avatar
      new helper: d_find_alias_rcu() · bca585d2
      Al Viro authored
      
      similar to d_find_alias(inode), except that
      	* the caller must be holding rcu_read_lock()
      	* inode must not be freed until matching rcu_read_unlock()
      	* result is *NOT* pinned and can only be dereferenced until
      the matching rcu_read_unlock().
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bca585d2
  14. Dec 10, 2020
    • Hao Li's avatar
      fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set · 77573fa3
      Hao Li authored
      
      If DCACHE_REFERENCED is set, fast_dput() will return true, and then
      retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result,
      the dentry won't be killed and the corresponding inode can't be evicted.
      In the following example, the DAX policy can't take effects unless we
      do a drop_caches manually.
      
        # DCACHE_LRU_LIST will be set
        echo abcdefg > test.txt
      
        # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything
        xfs_io -c 'chattr +x' test.txt
      
        # Drop caches to make DAX changing take effects
        echo 2 > /proc/sys/vm/drop_caches
      
      What this patch does is preventing fast_dput() from returning true if
      DCACHE_DONTCACHE is set. Then retain_dentry() will detect the
      DCACHE_DONTCACHE and will return false. As a result, the dentry will be
      killed and the inode will be evicted. In this way, if we change per-file
      DAX policy, it will take effects automatically after this file is closed
      by all processes.
      
      I also add some comments to make the code more clear.
      
      Signed-off-by: default avatarHao Li <lihao2018.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      77573fa3
  15. Jul 29, 2020
  16. May 13, 2020
  17. Apr 27, 2020
  18. Nov 15, 2019
    • Al Viro's avatar
      fs/namei.c: fix missing barriers when checking positivity · 2fa6b1e0
      Al Viro authored
      
      Pinned negative dentries can, generally, be made positive
      by another thread.  Conditions that prevent that are
      	* ->d_lock on dentry in question
      	* parent directory held at least shared
      	* nobody else could have observed the address of dentry
      Most of the places working with those fall into one of those
      categories; however, d_lookup() and friends need to be used
      with some care.  Fortunately, there's not a lot of call sites,
      and with few exceptions all of those fall under one of the
      cases above.
      
      Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
      and mountpoint_last().  Another one is lookup_slow() - there
      dcache lookup is done with parent held shared, but the result
      is used after we'd drop the lock.  The same happens in do_last() -
      the lookup (in lookup_one()) is done with parent locked, but
      result is used after unlocking.
      
      lookup_fast(), do_last() and mountpoint_last() flat-out reject
      negatives.
      
      Most of lookup_dcache() calls are made with parent locked at least
      shared; the only exception is lookup_one_len_unlocked().  It might
      return pinned negative, needs serious care from callers.  Fortunately,
      almost nobody calls it directly anymore; all but two callers have
      converted to lookup_positive_unlocked(), which rejects negatives.
      
      lookup_slow() is called by the same lookup_one_len_unlocked() (see
      above), mountpoint_last() and walk_component().  In those two negatives
      are rejected.
      
      In other words, there is a small set of places where we need to
      check carefully if a pinned potentially negative dentry is, in
      fact, positive.  After that check we want to be sure that both
      ->d_inode and type bits in ->d_flags are stable and observed.
      The set consists of follow_managed() (where the rejection happens
      for lookup_fast(), walk_component() and do_last()), last_mountpoint()
      and lookup_positive_unlocked().
      
      Solution:
      	1) transition from negative to positive (in __d_set_inode_and_type())
      stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
      	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
      smp_load_acquire() and bugger off if it type bits say "negative".
      That way anyone downstream of those checks has dentry know positive pinned,
      with ->d_inode and type bits of ->d_flags stable and observed.
      
      I considered splitting off d_lookup_positive(), so that the checks could
      be done right there, under ->d_lock.  However, that leads to massive
      duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
      worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
      No matter what, autofs_d_manage()/autofs_d_automount() must live with
      the possibility of pinned negative dentry passed their way, becoming
      positive under them - that's the intended behaviour when lookup comes
      in the middle of automount in progress, so we can't keep them out of
      the area that has to deal with those, more's the pity...
      
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2fa6b1e0
    • Al Viro's avatar
      fix dget_parent() fastpath race · e8400933
      Al Viro authored
      
      We are overoptimistic about taking the fast path there; seeing
      the same value in ->d_parent after having grabbed a reference
      to that parent does *not* mean that it has remained our parent
      all along.
      
      That wouldn't be a big deal (in the end it is our parent and
      we have grabbed the reference we are about to return), but...
      the situation with barriers is messed up.
      
      We might have hit the following sequence:
      
      d is a dentry of /tmp/a/b
      CPU1:					CPU2:
      parent = d->d_parent (i.e. dentry of /tmp/a)
      					rename /tmp/a/b to /tmp/b
      					rmdir /tmp/a, making its dentry negative
      grab reference to parent,
      end up with cached parent->d_inode (NULL)
      					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
      recheck d->d_parent, which is back to original
      decide that everything's fine and return the reference we'd got.
      
      The trouble is, caller (on CPU1) will observe dget_parent()
      returning an apparently negative dentry.  It actually is positive,
      but CPU1 has stale ->d_inode cached.
      
      Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
      NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
      we just go into the slow path in such case.  We don't wait for ->d_seq
      to become even either - again, if we are racing with renames, we
      can bloody well go to slow path anyway.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e8400933
  19. Oct 25, 2019
  20. Oct 09, 2019
    • Qian Cai's avatar
      locking/lockdep: Remove unused @nested argument from lock_release() · 5facae4f
      Qian Cai authored
      
      Since the following commit:
      
        b4adfe8e ("locking/lockdep: Remove unused argument in __lock_release")
      
      @nested is no longer used in lock_release(), so remove it from all
      lock_release() calls and friends.
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: alexander.levin@microsoft.com
      Cc: daniel@iogearbox.net
      Cc: davem@davemloft.net
      Cc: dri-devel@lists.freedesktop.org
      Cc: duyuyang@gmail.com
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: intel-gfx@lists.freedesktop.org
      Cc: jack@suse.com
      Cc: jlbec@evilplan.or
      Cc: joonas.lahtinen@linux.intel.com
      Cc: joseph.qi@linux.alibaba.com
      Cc: jslaby@suse.com
      Cc: juri.lelli@redhat.com
      Cc: maarten.lankhorst@linux.intel.com
      Cc: mark@fasheh.com
      Cc: mhocko@kernel.org
      Cc: mripard@kernel.org
      Cc: ocfs2-devel@oss.oracle.com
      Cc: rodrigo.vivi@intel.com
      Cc: sean@poorly.run
      Cc: st@kernel.org
      Cc: tj@kernel.org
      Cc: tytso@mit.edu
      Cc: vdavydov.dev@gmail.com
      Cc: vincent.guittot@linaro.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5facae4f
  21. Jul 10, 2019
    • Al Viro's avatar
      Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists · 9bdebc2b
      Al Viro authored
      
      Currently, running into a shrink list that contains dentries from different
      filesystems can cause several unpleasant things for shrink_dcache_parent()
      and for umount(2).
      
      The first problem is that there's a window during shrink_dentry_list() between
      __dentry_kill() takes a victim out and dropping reference to its parent.  During
      that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
      (or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
      eviction candidates and no indication that it needs to wait for some
      shrink_dentry_list() to proceed further.
      
      That applies for any shrink list that might intersect with the subtree we are
      trying to shrink; the only reason it does not blow on umount(2) in the mainline
      is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().
      
      Another problem happens if something in a mixed-filesystem shrink list gets
      be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
      the stuck shrinker to get around to our dentries.
      
      Solution:
              1) have shrink_dentry_list() decrement the parent's refcount and
      make sure it's on a shrink list (ours unless it already had been on some
      other) before calling __dentry_kill().  That eliminates the window when
      shrink_dcache_parent() would've blown past the entire subtree without
      noticing anything with zero refcount not on shrink lists.
      	2) when shrink_dcache_parent() has found no eviction candidates,
      but some dentries are still sitting on shrink lists, rather than
      repeating the scan in hope that shrinkers have progressed, scan looking
      for something on shrink lists with zero refcount.  If such a thing is
      found, grab rcu_read_lock() and stop the scan, with caller locking
      it for eviction, dropping out of RCU and doing __dentry_kill(), with
      the same treatment for parent as shrink_dentry_list() would do.
      
      Note that right now mixed-filesystem shrink lists do not occur, so this
      is not a mainline bug.  Howevere, there's a bunch of uses for such
      beasts (e.g. the "try and evict everything we can out of given page"
      patches; there are potential uses in mount-related code, considerably
      simplifying the life in fs/namespace.c, etc.)
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9bdebc2b
  22. Jun 20, 2019
    • Amir Goldstein's avatar
      fsnotify: move fsnotify_nameremove() hook out of d_delete() · 49246466
      Amir Goldstein authored
      
      d_delete() was piggy backed for the fsnotify_nameremove() hook when
      in fact not all callers of d_delete() care about fsnotify events.
      
      For all callers of d_delete() that may be interested in fsnotify events,
      we made sure to call one of fsnotify_{unlink,rmdir}() hooks before
      calling d_delete().
      
      Now we can move the fsnotify_nameremove() call from d_delete() to the
      fsnotify_{unlink,rmdir}() hooks.
      
      Two explicit calls to fsnotify_nameremove() from nfs/afs sillyrename
      are also removed. This will cause a change of behavior - nfs/afs will
      NOT generate an fsnotify delete event when renaming over a positive
      dentry.  This change is desirable, because it is consistent with the
      behavior of all other filesystems.
      
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      49246466
  23. May 21, 2019
  24. Apr 26, 2019
  25. Apr 17, 2019
  26. Apr 09, 2019
    • Al Viro's avatar
      unexport d_alloc_pseudo() · ab1152dd
      Al Viro authored
      
      No modular uses since introducion of alloc_file_pseudo(),
      and the only non-modular user not in alloc_file_pseudo()
      had actually been wrong - should've been d_alloc_anon().
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ab1152dd
    • Al Viro's avatar
      dcache: sort the freeing-without-RCU-delay mess for good. · 5467a68c
      Al Viro authored
      
      For lockless accesses to dentries we don't have pinned we rely
      (among other things) upon having an RCU delay between dropping
      the last reference and actually freeing the memory.
      
      On the other hand, for things like pipes and sockets we neither
      do that kind of lockless access, nor want to deal with the
      overhead of an RCU delay every time a socket gets closed.
      
      So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
      made sure it would happen.  We tried to avoid setting it unless
      we knew we need it.  Unfortunately, that had led to recurring
      class of bugs, in which we missed the need to set it.
      
      We only really need it for dentries that are created by
      d_alloc_pseudo(), so let's not bother with trying to be smart -
      just make having an RCU delay the default.  The ones that do
      *not* get it set the replacement flag (DCACHE_NORCU) and we'd
      better use that sparingly.  d_alloc_pseudo() is the only
      such user right now.
      
      FWIW, the race that finally prompted that switch had been
      between __lock_parent() of immediate subdirectory of what's
      currently the root of a disconnected tree (e.g. from
      open-by-handle in progress) racing with d_splice_alias()
      elsewhere picking another alias for the same inode, either
      on outright corrupted fs image, or (in case of open-by-handle
      on NFS) that subdirectory having been just moved on server.
      It's not easy to hit, so the sky is not falling, but that's
      not the first race on similar missed cases and the logics
      for settinf DCACHE_RCUACCESS has gotten ridiculously
      convoluted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5467a68c
  27. Jan 30, 2019
    • Waiman Long's avatar
      fs/dcache: Track & report number of negative dentries · af0c9af1
      Waiman Long authored
      
      The current dentry number tracking code doesn't distinguish between
      positive & negative dentries.  It just reports the total number of
      dentries in the LRU lists.
      
      As excessive number of negative dentries can have an impact on system
      performance, it will be wise to track the number of positive and
      negative dentries separately.
      
      This patch adds tracking for the total number of negative dentries in
      the system LRU lists and reports it in the 5th field in the
      /proc/sys/fs/dentry-state file.  The number, however, does not include
      negative dentries that are in flight but not in the LRU yet as well as
      those in the shrinker lists which are on the way out anyway.
      
      The number of positive dentries in the LRU lists can be roughly found by
      subtracting the number of negative dentries from the unused count.
      
      Matthew Wilcox had confirmed that since the introduction of the
      dentry_stat structure in 2.1.60, the dummy array was there, probably for
      future extension.  They were not replacements of pre-existing fields.
      So no sane applications that read the value of /proc/sys/fs/dentry-state
      will do dummy thing if the last 2 fields of the sysctl parameter are not
      zero.  IOW, it will be safe to use one of the dummy array entry for
      negative dentry count.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0c9af1
    • Waiman Long's avatar
      fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() · 1dbd449c
      Waiman Long authored
      
      The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
      lists and the shrink lists where the DCACHE_LRU_LIST bit is set.
      
      The shrink_dcache_sb() function moves dentries from the LRU list to a
      shrink list and subtracts the dentry count from nr_dentry_unused.  This
      is incorrect as the nr_dentry_unused count will also be decremented in
      shrink_dentry_list() via d_shrink_del().
      
      To fix this double decrement, the decrement in the shrink_dcache_sb()
      function is taken out.
      
      Fixes: 4e717f5c ("list_lru: remove special case function list_lru_dispose_all."
      Cc: stable@kernel.org
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1dbd449c
  28. Oct 31, 2018
  29. Oct 26, 2018
  30. Aug 17, 2018
    • Tetsuo Handa's avatar
      fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot() · 6cd00a01
      Tetsuo Handa authored
      Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
      are initialized at __d_alloc(), we can't copy the whole size
      unconditionally.
      
       WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
       636f6e66696766732e746d70000000000010000000000000020000000188ffff
        i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
                                        ^
       RIP: 0010:take_dentry_name_snapshot+0x28/0x50
       RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
       RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
       RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
       RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
       R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
       R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
       FS:  00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
        take_dentry_name_snapshot+0x28/0x50
        vfs_rename+0x128/0x870
        SyS_rename+0x3b2/0x3d0
        entry_SYSCALL_64_fastpath+0x1a/0xa4
        0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cd00a01
  31. Aug 09, 2018
    • Al Viro's avatar
      make sure that __dentry_kill() always invalidates d_seq, unhashed or not · 4c0d7cd5
      Al Viro authored
      
      RCU pathwalk relies upon the assumption that anything that changes
      ->d_inode of a dentry will invalidate its ->d_seq.  That's almost
      true - the one exception is that the final dput() of already unhashed
      dentry does *not* touch ->d_seq at all.  Unhashing does, though,
      so for anything we'd found by RCU dcache lookup we are fine.
      Unfortunately, we can *start* with an unhashed dentry or jump into
      it.
      
      We could try and be careful in the (few) places where that could
      happen.  Or we could just make the final dput() invalidate the damn
      thing, unhashed or not.  The latter is much simpler and easier to
      backport, so let's do it that way.
      
      Reported-by: default avatar"Dae R. Jeong" <threeearcat@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4c0d7cd5
  32. Aug 06, 2018
    • Al Viro's avatar
      root dentries need RCU-delayed freeing · 90bad5e0
      Al Viro authored
      
      Since mountpoint crossing can happen without leaving lazy mode,
      root dentries do need the same protection against having their
      memory freed without RCU delay as everything else in the tree.
      
      It's partially hidden by RCU delay between detaching from the
      mount tree and dropping the vfsmount reference, but the starting
      point of pathwalk can be on an already detached mount, in which
      case umount-caused RCU delay has already passed by the time the
      lazy pathwalk grabs rcu_read_lock().  If the starting point
      happens to be at the root of that vfsmount *and* that vfsmount
      covers the entire filesystem, we get trouble.
      
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      90bad5e0
Loading