Skip to content
Snippets Groups Projects
  • Hugh Dickins's avatar
    f8f931bb
    mm/thp: fix deferred split unqueue naming and locking · f8f931bb
    Hugh Dickins authored
    Recent changes are putting more pressure on THP deferred split queues:
    under load revealing long-standing races, causing list_del corruptions,
    "Bad page state"s and worse (I keep BUGs in both of those, so usually
    don't get to see how badly they end up without).  The relevant recent
    changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
    improved swap allocation, and underused THP splitting.
    
    Before fixing locking: rename misleading folio_undo_large_rmappable(),
    which does not undo large_rmappable, to folio_unqueue_deferred_split(),
    which is what it does.  But that and its out-of-line __callee are mm
    internals of very limited usability: add comment and WARN_ON_ONCEs to
    check usage; and return a bool to say if a deferred split was unqueued,
    which can then be used in WARN_ON_ONCEs around safety checks (sparing
    callers the arcane conditionals in __folio_unqueue_deferred_split()).
    
    Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
    of whose callers now call it beforehand (and if any forget then bad_page()
    will tell) - except for its caller put_pages_list(), which itself no
    longer has any callers (and will be deleted separately).
    
    Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
    without checking and unqueueing a THP folio from deferred split list;
    which is unfortunate, since the split_queue_lock depends on the memcg
    (when memcg is enabled); so swapout has been unqueueing such THPs later,
    when freeing the folio, using the pgdat's lock instead: potentially
    corrupting the memcg's list.  __remove_mapping() has frozen refcount to 0
    here, so no problem with calling folio_unqueue_deferred_split() before
    resetting memcg_data.
    
    That goes back to 5.4 commit 87eaceb3 ("mm: thp: make deferred split
    shrinker memcg aware"): which included a check on swapcache before adding
    to deferred queue, but no check on deferred queue before adding THP to
    swapcache.  That worked fine with the usual sequence of events in reclaim
    (though there were a couple of rare ways in which a THP on deferred queue
    could have been swapped out), but 6.12 commit dafff3f4 ("mm: split
    underused THPs") avoids splitting underused THPs in reclaim, which makes
    swapcache THPs on deferred queue commonplace.
    
    Keep the check on swapcache before adding to deferred queue?  Yes: it is
    no longer essential, but preserves the existing behaviour, and is likely
    to be a worthwhile optimization (vmstat showed much more traffic on the
    queue under swapping load if the check was removed); update its comment.
    
    Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
    folio->memcg_data without checking and unqueueing a THP folio from the
    deferred list, sometimes corrupting "from" memcg's list, like swapout. 
    Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
    used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
    mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
    of course unqueues), or skip it if that fails.  Not ideal, but moving
    charge has been requested, and khugepaged should repair the THP later:
    nobody wants new custom unqueueing code just for this deprecated case.
    
    The 87eaceb3 commit did have the code to move from one deferred list
    to another (but was not conscious of its unsafety while refcount non-0);
    but that was removed by 5.6 commit fac0516b ("mm: thp: don't need care
    deferred split queue in memcg charge move path"), which argued that the
    existence of a PMD mapping guarantees that the THP cannot be on a deferred
    list.  As above, false in rare cases, and now commonly false.
    
    Backport to 6.11 should be straightforward.  Earlier backports must take
    care that other _deferred_list fixes and dependencies are included.  There
    is not a strong case for backports, but they can fix cornercases.
    
    Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
    
    
    Fixes: 87eaceb3 ("mm: thp: make deferred split shrinker memcg aware")
    Fixes: dafff3f4 ("mm: split underused THPs")
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Usama Arif <usamaarif642@gmail.com>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    f8f931bb
    History
    mm/thp: fix deferred split unqueue naming and locking
    Hugh Dickins authored
    Recent changes are putting more pressure on THP deferred split queues:
    under load revealing long-standing races, causing list_del corruptions,
    "Bad page state"s and worse (I keep BUGs in both of those, so usually
    don't get to see how badly they end up without).  The relevant recent
    changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
    improved swap allocation, and underused THP splitting.
    
    Before fixing locking: rename misleading folio_undo_large_rmappable(),
    which does not undo large_rmappable, to folio_unqueue_deferred_split(),
    which is what it does.  But that and its out-of-line __callee are mm
    internals of very limited usability: add comment and WARN_ON_ONCEs to
    check usage; and return a bool to say if a deferred split was unqueued,
    which can then be used in WARN_ON_ONCEs around safety checks (sparing
    callers the arcane conditionals in __folio_unqueue_deferred_split()).
    
    Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
    of whose callers now call it beforehand (and if any forget then bad_page()
    will tell) - except for its caller put_pages_list(), which itself no
    longer has any callers (and will be deleted separately).
    
    Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
    without checking and unqueueing a THP folio from deferred split list;
    which is unfortunate, since the split_queue_lock depends on the memcg
    (when memcg is enabled); so swapout has been unqueueing such THPs later,
    when freeing the folio, using the pgdat's lock instead: potentially
    corrupting the memcg's list.  __remove_mapping() has frozen refcount to 0
    here, so no problem with calling folio_unqueue_deferred_split() before
    resetting memcg_data.
    
    That goes back to 5.4 commit 87eaceb3 ("mm: thp: make deferred split
    shrinker memcg aware"): which included a check on swapcache before adding
    to deferred queue, but no check on deferred queue before adding THP to
    swapcache.  That worked fine with the usual sequence of events in reclaim
    (though there were a couple of rare ways in which a THP on deferred queue
    could have been swapped out), but 6.12 commit dafff3f4 ("mm: split
    underused THPs") avoids splitting underused THPs in reclaim, which makes
    swapcache THPs on deferred queue commonplace.
    
    Keep the check on swapcache before adding to deferred queue?  Yes: it is
    no longer essential, but preserves the existing behaviour, and is likely
    to be a worthwhile optimization (vmstat showed much more traffic on the
    queue under swapping load if the check was removed); update its comment.
    
    Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
    folio->memcg_data without checking and unqueueing a THP folio from the
    deferred list, sometimes corrupting "from" memcg's list, like swapout. 
    Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
    used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
    mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
    of course unqueues), or skip it if that fails.  Not ideal, but moving
    charge has been requested, and khugepaged should repair the THP later:
    nobody wants new custom unqueueing code just for this deprecated case.
    
    The 87eaceb3 commit did have the code to move from one deferred list
    to another (but was not conscious of its unsafety while refcount non-0);
    but that was removed by 5.6 commit fac0516b ("mm: thp: don't need care
    deferred split queue in memcg charge move path"), which argued that the
    existence of a PMD mapping guarantees that the THP cannot be on a deferred
    list.  As above, false in rare cases, and now commonly false.
    
    Backport to 6.11 should be straightforward.  Earlier backports must take
    care that other _deferred_list fixes and dependencies are included.  There
    is not a strong case for backports, but they can fix cornercases.
    
    Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
    
    
    Fixes: 87eaceb3 ("mm: thp: make deferred split shrinker memcg aware")
    Fixes: dafff3f4 ("mm: split underused THPs")
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Barry Song <baohua@kernel.org>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Usama Arif <usamaarif642@gmail.com>
    Cc: Wei Yang <richard.weiyang@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>