Skip to content
Snippets Groups Projects
  1. Jan 14, 2025
  2. Jun 12, 2024
  3. Apr 11, 2024
  4. Jan 23, 2024
    • David Howells's avatar
      mm: merge folio_has_private()/filemap_release_folio() call pairs · 9278b13d
      David Howells authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit 0201ebf2 ]
      
      Patch series "mm, netfs, fscache: Stop read optimisation when folio
      removed from pagecache", v7.
      
      This fixes an optimisation in fscache whereby we don't read from the cache
      for a particular file until we know that there's data there that we don't
      have in the pagecache.  The problem is that I'm no longer using PG_fscache
      (aka PG_private_2) to indicate that the page is cached and so I don't get
      a notification when a cached page is dropped from the pagecache.
      
      The first patch merges some folio_has_private() and
      filemap_release_folio() pairs and introduces a helper,
      folio_needs_release(), to indicate if a release is required.
      
      The second patch is the actual fix.  Following Willy's suggestions[1], it
      adds an AS_RELEASE_ALWAYS flag to an address_space that will make
      filemap_release_folio() always call ->release_folio(), even if
      PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
      add a check for this.
      
      This patch (of 2):
      
      Make filemap_release_folio() check folio_has_private().  Then, in most
      cases, where a call to folio_has_private() is immediately followed by a
      call to filemap_release_folio(), we can get rid of the test in the pair.
      
      There are a couple of sites in mm/vscan.c that this can't so easily be
      done.  In shrink_folio_list(), there are actually three cases (something
      different is done for incompletely invalidated buffers), but
      filemap_release_folio() elides two of them.
      
      In shrink_active_list(), we don't have have the folio lock yet, so the
      check allows us to avoid locking the page unnecessarily.
      
      A wrapper function to check if a folio needs release is provided for those
      places that still need to do it in the mm/ directory.  This will acquire
      additional parts to the condition in a future patch.
      
      After this, the only remaining caller of folio_has_private() outside of
      mm/ is a check in fuse.
      
      Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
      Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
      
      
      Reported-by: default avatarRohith Surabattula <rohiths.msft@gmail.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steve French <sfrench@samba.org>
      Cc: Shyam Prasad N <nspmangalore@gmail.com>
      Cc: Rohith Surabattula <rohiths.msft@gmail.com>
      Cc: Dave Wysochanski <dwysocha@redhat.com>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Xiubo Li <xiubli@redhat.com>
      Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Stable-dep-of: 1898efcd ("block: update the stable_writes flag in bdev_add")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9278b13d
    • Charan Teja Kalla's avatar
      mm: migrate high-order folios in swap cache correctly · 14f9dbe1
      Charan Teja Kalla authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit fc346d0a upstream.
      
      Large folios occupy N consecutive entries in the swap cache instead of
      using multi-index entries like the page cache.  However, if a large folio
      is re-added to the LRU list, it can be migrated.  The migration code was
      not aware of the difference between the swap cache and the page cache and
      assumed that a single xas_store() would be sufficient.
      
      This leaves potentially many stale pointers to the now-migrated folio in
      the swap cache, which can lead to almost arbitrary data corruption in the
      future.  This can also manifest as infinite loops with the RCU read lock
      held.
      
      [willy@infradead.org: modifications to the changelog & tweaked the fix]
      Fixes: 3417013e ("mm/migrate: Add folio_migrate_mapping()")
      Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org
      
      
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
      
      
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14f9dbe1
  5. Nov 30, 2023
  6. Feb 22, 2023
  7. Oct 28, 2022
  8. Oct 13, 2022
    • Alistair Popple's avatar
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple authored
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
      
      
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16ce101d
  9. Oct 03, 2022
  10. Sep 27, 2022
    • Haiyue Wang's avatar
      mm: fix the handling Non-LRU pages returned by follow_page · f7091ed6
      Haiyue Wang authored
      The handling Non-LRU pages returned by follow_page() jumps directly, it
      doesn't call put_page() to handle the reference count, since 'FOLL_GET'
      flag for follow_page() has get_page() called.  Fix the zone device page
      check by handling the page reference count correctly before returning.
      
      And as David reviewed, "device pages are never PageKsm pages".  Drop this
      zone device page check for break_ksm().
      
      Since the zone device page can't be a transparent huge page, so drop the
      redundant zone device page check for split_huge_pages_pid().  (by Miaohe)
      
      Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
      
      
      Fixes: 3218f871 ("mm: handling Non-LRU pages returned by vm_normal_pages")
      Signed-off-by: default avatarHaiyue Wang <haiyue.wang@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7091ed6
    • Aneesh Kumar K.V's avatar
      mm/demotion: update node_is_toptier to work with memory tiers · 467b171a
      Aneesh Kumar K.V authored
      With memory tier support we can have memory only NUMA nodes in the top
      tier from which we want to avoid promotion tracking NUMA faults.  Update
      node_is_toptier to work with memory tiers.  All NUMA nodes are by default
      top tier nodes.  With lower(slower) memory tiers added we consider all
      memory tiers above a memory tier having CPU NUMA nodes as a top memory
      tier
      
      [sj@kernel.org: include missed header file, memory-tiers.h]
        Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
      [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
      [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
        Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      467b171a
    • Aneesh Kumar K.V's avatar
      mm/demotion: build demotion targets based on explicit memory tiers · 6c542ab7
      Aneesh Kumar K.V authored
      This patch switch the demotion target building logic to use memory tiers
      instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed in the
      default memory tier and additional memory tiers will be added by drivers
      like dax kmem.
      
      This patch builds the demotion target for a NUMA node by looking at all
      memory tiers below the tier to which the NUMA node belongs.  The closest
      node in the immediately following memory tier is used as a demotion
      target.
      
      Since we are now only building demotion target for N_MEMORY NUMA nodes the
      CPU hotplug calls are removed in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c542ab7
    • Aneesh Kumar K.V's avatar
      mm/demotion: move memory demotion related code · 91952440
      Aneesh Kumar K.V authored
      This moves memory demotion related code to mm/memory-tiers.c.  No
      functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91952440
    • Baolin Wang's avatar
      mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP · 7047b5a4
      Baolin Wang authored
      If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
      be split, and the subpages of fail-to-migrate THP will be tried to migrate
      again, so we should not account the retry counter in the second loop,
      since we already accounted 'nr_thp_failed' in the first loop.
      
      Moreover we also do not need retry 10 times for -EAGAIN case for the
      subpages of fail-to-migrate THP in the second loop, since we already
      regarded the THP as migration failure, and save some migration time (for
      the worst case, will try 512 * 10 times) according to previous discussion
      [1].
      
      [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com
      
      
      Tested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7047b5a4
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for retry · 077309bc
      Huang Ying authored
      After 10 retries, we will give up and the remaining pages will be counted
      as failure in nr_failed and nr_thp_failed.  We should count the failure in
      nr_failed_pages too.  This is done in this patch.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      077309bc
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP splitting · e6fa8a79
      Huang Ying authored
      If THP is failed to be migrated, it may be split and retry.  But after
      splitting, the head page will be left in "from" list, although THP
      migration failure has been counted already.  If the head page is failed to
      be migrated too, the failure will be counted twice incorrectly.  So this
      is fixed in this patch via moving the head page of THP after splitting to
      "thp_split_pages" too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6fa8a79
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP on -ENOSYS · 577be05c
      Huang Ying authored
      If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
      unmap_and_move_huge_page() will return -ENOSYS.  For THP, splitting will
      be tried, but if splitting doesn't succeed, the THP will be left in "from"
      list wrongly.  If some other pages are retried, the THP migration failure
      will counted again.  This is fixed via moving the failure THP from "from"
      to "ret_pages".
      
      Another issue of the original code is that the unsupported failure
      processing isn't consistent between THP and hugetlbfs page.  Make them
      consistent in this patch to make the code easier to be understood too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      577be05c
    • Huang Ying's avatar
      migrate_pages(): fix failure counting for THP subpages retrying · 5fc30916
      Huang Ying authored
      If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
      split into thp_split_pages, and after other pages are migrated, pages in
      thp_split_pages will be migrated with no_subpage_counting == true, because
      its failure have been counted already.  If some pages in thp_split_pages
      are retried during migration, we should not count their failure if
      no_subpage_counting == true too.  This is done this patch to fix the
      failure counting for THP subpages retrying.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5fc30916
    • Huang Ying's avatar
      migrate_pages(): fix THP failure counting for -ENOMEM · fbed53b4
      Huang Ying authored
      In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
      returned, and migrate_pages() will try to split the THP unless "reason" is
      MR_NUMA_MISPLACED (that is, nosplit == true).  But when nosplit == true,
      the THP migration failure will not be counted.
      
      This is incorrect, so in this patch, the THP migration failure will be
      counted for -ENOMEM regardless of nosplit is true or false.  The nr_failed
      counting isn't fixed because it's not used.  Added some comments for it
      per Baolin's suggestion.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fbed53b4
    • Huang Ying's avatar
      migrate_pages(): remove unnecessary list_safe_reset_next() · 9c62ff00
      Huang Ying authored
      Before commit b5bade97 ("mm: migrate: fix the return value of
      migrate_pages()"), the tail pages of THP will be put in the "from"
      list directly.  So one of the loop cursors (page2) needs to be reset,
      as is done in try_split_thp() via list_safe_reset_next().  But after
      the commit, the tail pages of THP will be put in a dedicated
      list (thp_split_pages).  That is, the "from" list will not be changed
      during splitting.  So, it's unnecessary to call list_safe_reset_next()
      anymore.
      
      This is a code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c62ff00
    • Huang Ying's avatar
      migrate: fix syscall move_pages() return value for failure · a7504ed1
      Huang Ying authored
      Patch series "migrate_pages(): fix several bugs in error path", v3.
      
      During review the code of migrate_pages() and build a test program for
      it.  Several bugs in error path are identified and fixed in this
      series.
      
      Most patches are tested via
      
      - Apply error-inject.patch in Linux kernel
      - Compile test-migrate.c (with -lnuma)
      - Test with test-migrate.sh
      
      error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
      It turns out that error injection is an important tool to fix bugs in
      error path.
      
      
      This patch (of 8):
      
      The return value of move_pages() syscall is incorrect when counting
      the remaining pages to be migrated.  For example, for the following
      test program,
      
      "
       #define _GNU_SOURCE
      
       #include <stdbool.h>
       #include <stdio.h>
       #include <string.h>
       #include <stdlib.h>
       #include <errno.h>
      
       #include <fcntl.h>
       #include <sys/uio.h>
       #include <sys/mman.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <numaif.h>
       #include <numa.h>
      
       #ifndef MADV_FREE
       #define MADV_FREE	8		/* free pages only if memory pressure */
       #endif
      
       #define ONE_MB		(1024 * 1024)
       #define MAP_SIZE	(16 * ONE_MB)
       #define THP_SIZE	(2 * ONE_MB)
       #define THP_MASK	(THP_SIZE - 1)
      
       #define ERR_EXIT_ON(cond, msg)					\
      	 do {							\
      		 int __cond_in_macro = (cond);			\
      		 if (__cond_in_macro)				\
      			 error_exit(__cond_in_macro, (msg));	\
      	 } while (0)
      
       void error_msg(int ret, int nr, int *status, const char *msg)
       {
      	 int i;
      
      	 fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
      		 msg, ret, strerror(errno));
      
      	 if (!nr)
      		 return;
      	 fprintf(stderr, "status: ");
      	 for (i = 0; i < nr; i++)
      		 fprintf(stderr, "%d ", status[i]);
      	 fprintf(stderr, "\n");
       }
      
       void error_exit(int ret, const char *msg)
       {
      	 error_msg(ret, 0, NULL, msg);
      	 exit(1);
       }
      
       int page_size;
      
       bool do_vmsplice;
       bool do_thp;
      
       static int pipe_fds[2];
       void *addr;
       char *pn;
       char *pn1;
       void *pages[2];
       int status[2];
      
       void prepare()
       {
      	 int ret;
      	 struct iovec iov;
      
      	 if (addr) {
      		 munmap(addr, MAP_SIZE);
      		 close(pipe_fds[0]);
      		 close(pipe_fds[1]);
      	 }
      
      	 ret = pipe(pipe_fds);
      	 ERR_EXIT_ON(ret, "pipe");
      
      	 addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	 ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
      	 if (do_thp) {
      		 ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
      		 ERR_EXIT_ON(ret, "advise hugepage");
      	 }
      
      	 pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
      	 pn1 = pn + THP_SIZE;
      	 pages[0] = pn;
      	 pages[1] = pn1;
      	 *pn = 1;
      
      	 if (do_vmsplice) {
      		 iov.iov_base = pn;
      		 iov.iov_len = page_size;
      		 ret = vmsplice(pipe_fds[1], &iov, 1, 0);
      		 ERR_EXIT_ON(ret < 0, "vmsplice");
      	 }
      
      	 status[0] = status[1] = 1024;
       }
      
       void test_migrate()
       {
      	 int ret;
      	 int nodes[2] = { 1, 1 };
      	 pid_t pid = getpid();
      
      	 prepare();
      	 ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 1, status, "move 1 page");
      
      	 prepare();
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");
      
      	 prepare();
      	 *pn1 = 1;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages");
      
      	 prepare();
      	 *pn1 = 1;
      	 nodes[1] = 0;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
       }
      
       int main(int argc, char *argv[])
       {
      	 numa_run_on_node(0);
      	 page_size = getpagesize();
      
      	 test_migrate();
      
      	 fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTest THP:\n");
      	 do_thp = true;
      	 do_vmsplice = false;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 return 0;
       }
      "
      
      The output of the current kernel is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 0, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
      status: 1024 1024
      "
      
      While the expected output is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 1, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 1, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
      status: 1024 1024
      "
      
      Fix this via correcting the remaining pages counting.  With the fix,
      the output for the test program as above is expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
      
      
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7504ed1
    • Peter Xu's avatar
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu authored
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e346877
  11. Sep 12, 2022
    • Huang Ying's avatar
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying authored
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      33024536
    • Haiyue Wang's avatar
      mm: migration: fix the FOLL_GET failure on following huge page · 83156821
      Haiyue Wang authored
      Not all huge page APIs support FOLL_GET option, so move_pages() syscall
      will fail to get the page node information for some huge pages.
      
      Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
      return NULL page for FOLL_GET when calling move_pages() syscall with the
      NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
      
      Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
            Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
      
      But these huge page APIs don't support FOLL_GET:
        1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
        2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
           It will cause WARN_ON_ONCE for FOLL_GET.
        3. follow_huge_pgd() in mm/hugetlb.c
      
      This is an temporary solution to mitigate the side effect of the race
      condition fix by calling follow_page() with FOLL_GET set for huge pages.
      
      After supporting follow huge page by FOLL_GET is done, this fix can be
      reverted safely.
      
      Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
      Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
      
      
      Fixes: 4cd61484 ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
      Signed-off-by: default avatarHaiyue Wang <haiyue.wang@intel.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83156821
  12. Aug 02, 2022
Loading