- Jan 14, 2025
-
-
commit 851ae642 upstream. In commit fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt"), if the THP splitting fails due to page reference count, we will retry to improve migration successful rate. But the failed splitting is counted as migration failure and migration retry, which will cause duplicated failure counting. So, in this patch, this is fixed via undoing the failure counting if we decide to retry. The patch is tested via failure injection. Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com Fixes: fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
commit e77d587a upstream. The migration code ends up temporarily stashing information of the wrong type in unused fields of the newly allocated destination folio. That all works fine, but gcc does complain about the pointer type mis-use: mm/migrate.c: In function ‘__migrate_folio_extract’: mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’ 1050 | *anon_vmap = (void *)dst->mapping; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ and gcc is actually right to complain since it really doesn't understand that this is a very temporary special case where this is ok. This could be fixed in different ways by just obfuscating the assignment sufficiently that gcc doesn't see what is going on, but the truly "proper C" way to do this is by explicitly using a union. Using unions for type conversions like this is normally hugely ugly and syntactically nasty, but this really is one of the few cases where we want to make it clear that we're not doing type conversion, we're really re-using the value bit-for-bit just using another type. IOW, this should not become a common pattern, but in this one case using that odd union is probably the best way to document to the compiler what is conceptually going on here. [ Side note: there are valid cases where we convert pointers to other pointer types, notably the whole "folio vs page" situation, where the types actually have fundamental commonalities. The fact that the gcc note is limited to just randomized structures means that we don't see equivalent warnings for those cases, but it migth also mean that we miss other cases where we do play these kinds of dodgy games, and this kind of explicit conversion might be a good idea. ] I verified that at least for an allmodconfig build on x86-64, this generates the exact same code, apart from line numbers and assembler comment changes. Fixes: 64c8902e ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Cc: Huang, Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
[ Upstream commit 35e41024 ] When numa balancing is enabled with demotion, vmscan will call migrate_pages when shrinking LRUs. migrate_pages will decrement the the node's isolated page count, leading to an imbalanced count when invoked from (MG)LRU code. The result is dmesg output like such: $ cat /proc/sys/vm/stat_refresh [77383.088417] vmstat_refresh: nr_isolated_anon -103212 [77383.088417] vmstat_refresh: nr_isolated_file -899642 This negative value may impact compaction and reclaim throttling. The following path produces the decrement: shrink_folio_list demote_folio_list migrate_pages migrate_pages_batch migrate_folio_move migrate_folio_done mod_node_page_state(-ve) <- decrement This path happens for SUCCESSFUL migrations, not failures. Typically callers to migrate_pages are required to handle putback/accounting for failures, but this is already handled in the shrink code. When accounting for migrations, instead do not decrement the count when the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is the only source of MR_DEMOTION. Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net Fixes: 26aa2d19 ("mm/migrate: demote pages during reclaim") Signed-off-by:
Gregory Price <gourry@gourry.net> Reviewed-by:
Yang Shi <shy828301@gmail.com> Reviewed-by:
Davidlohr Bueso <dave@stgolabs.net> Reviewed-by:
Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit 64c8902e ] This is a preparation patch to batch the folio unmapping and moving. In this patch, unmap_and_move() is split to migrate_folio_unmap() and migrate_folio_move(). So, we can batch _unmap() and _move() in different loops later. To pass some information between unmap and move, the original unused dst->mapping and dst->private are used. Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit 42012e04 ] This is a preparation patch to batch the folio unmapping and moving for non-hugetlb folios. If we had batched the folio unmapping, all folios to be migrated would be unmapped before copying the contents and flags of the folios. If the folios that were passed to migrate_pages() were too many in unit of pages, the execution of the processes would be stopped for too long time, thus too long latency. For example, migrate_pages() syscall will call migrate_pages() with all folios of a process. To avoid this possible issue, in this patch, we restrict the number of pages to be migrated to be no more than HPAGE_PMD_NR. That is, the influence is at the same level of THP migration. Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit e5bfff8b ] This is a preparation patch to batch the folio unmapping and moving for the non-hugetlb folios. Based on that we can batch the TLB shootdown during the folio migration and make it possible to use some hardware accelerator for the folio copying. In this patch the hugetlb folios and non-hugetlb folios migration is separated in migrate_pages() to make it easy to change the non-hugetlb folios migration implementation. Link: https://lkml.kernel.org/r/20230213123444.155149-3-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit 5b855937 ] Patch series "migrate_pages(): batch TLB flushing", v5. Now, migrate_pages() migrates folios one by one, like the fake code as follows, for each folio unmap flush TLB copy restore map If multiple folios are passed to migrate_pages(), there are opportunities to batch the TLB flushing and copying. That is, we can change the code to something as follows, for each folio unmap for each folio flush TLB for each folio copy for each folio restore map The total number of TLB flushing IPI can be reduced considerably. And we may use some hardware accelerator such as DSA to accelerate the folio copying. So in this patch, we refactor the migrate_pages() implementation and implement the TLB flushing batching. Base on this, hardware accelerated folio copying can be implemented. If too many folios are passed to migrate_pages(), in the naive batched implementation, we may unmap too many folios at the same time. The possibility for a task to wait for the migrated folios to be mapped again increases. So the latency may be hurt. To deal with this issue, the max number of folios be unmapped in batch is restricted to no more than HPAGE_PMD_NR in the unit of page. That is, the influence is at the same level of THP migration. We use the following test to measure the performance impact of the patchset, On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. Xin Hao helped to test the patchset on an ARM64 server with 128 cores, 2 NUMA nodes. Test results show that the page migration performance increases up to 78%. This patch (of 9): Define struct migrate_pages_stats to organize the various statistics in migrate_pages(). This makes it easier to collect and consume the statistics in multiple functions. This will be needed in the following patches in the series. Link: https://lkml.kernel.org/r/20230213123444.155149-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20230213123444.155149-2-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Alistair Popple <apopple@nvidia.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Xin Hao <xhao@linux.alibaba.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit 4c74b65f ] mm/migrate.c:1198:24: warning: Using plain integer as NULL pointer Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3080 Link: https://lkml.kernel.org/r/20221116012345.84870-1-yang.lee@linux.alibaba.com Signed-off-by:
Yang Li <yang.lee@linux.alibaba.com> Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit eaec4e63 ] Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. THP specific code are converted to be large folio. Link: https://lkml.kernel.org/r/20221109012348.93849-3-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit 49f51859 ] Patch series "migrate: convert migrate_pages()/unmap_and_move() to use folios", v2. The conversion is quite straightforward, just replace the page API to the corresponding folio API. migrate_pages() and unmap_and_move() mostly work with folios (head pages) only. This patch (of 2): Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. Link: https://lkml.kernel.org/r/20221109012348.93849-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20221109012348.93849-2-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Yang Shi <shy828301@gmail.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
[ Upstream commit fd4a7ac3 ] When creating a virtual machine, we will use memfd_create() to get a file descriptor which can be used to create share memory mappings using the mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to allocate physical pages for the virtual machine. When allocating physical pages for the guest, the host can fallback to allocate some CMA pages for the guest when over half of the zone's free memory is in the CMA area. In guest os, when the application wants to do some data transaction with DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and create IOMMU mappings for the DMA pages. However, when calling VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be failed to longterm-pin sometimes. After some invetigation, we found the pages used to do DMA mapping can contain some CMA pages, and these CMA pages will cause a possible failure of the longterm-pin, due to failed to migrate the CMA pages. The reason of migration failure may be temporary reference count or memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error, which makes the application failed to start. I observed one migration failure case (which is not easy to reproduce) is that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' count is also 1. That means when migrating a THP which is in CMA area, but can not allocate a new THP due to memory fragmentation, so it will split the THP. However THP split is also failed, probably the reason is temporary reference count of this THP. And the temporary reference count can be caused by dropping page caches (I observed the drop caches operation in the system), but we can not drop the shmem page caches due to they are already dirty at that time. Especially for THP split failure, which is caused by temporary reference count, we can try again to mitigate the failure of migration in this case according to previous discussion [1]. [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/ Link: https://lkml.kernel.org/r/6784730480a1df82e8f4cba1ed088e4ac767994b.1666599848.git.baolin.wang@linux.alibaba.com Signed-off-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 35e41024 ("vmscan,migrate: fix page count imbalance on node stats when demoting pages") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
- Jun 12, 2024
-
-
[ Upstream commit e51da3a9 ] Helper function to retrieve hstate information from a hugetlb folio. Link: https://lkml.kernel.org/r/20220922154207.1575343-6-sidhartha.kumar@oracle.com Signed-off-by:
Sidhartha Kumar <sidhartha.kumar@oracle.com> Reported-by:
kernel test robot <lkp@intel.com> Reviewed-by:
Mike Kravetz <mike.kravetz@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Cross <ccross@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: b76b4690 ("mm/hugetlb: fix missing hugetlb_lock for resv uncharge") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
- Apr 11, 2024
-
-
The tail pages in a THP can have swap entry information stored in their private field. When migrating to a new page, all tail pages of the new page need to update ->private to avoid future data corruption. This fix is stable-only, since after commit 07e09c48 ("mm/huge_memory: work on folio->swap instead of page->private when splitting folio"), subpages of a swapcached THP no longer requires the maintenance. Adding THPs to the swapcache was introduced in commit 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out"), where each subpage of a THP added to the swapcache had its own swapcache entry and required the ->private field to point to the correct swapcache entry. Later, when THP migration functionality was implemented in commit 616b8371 ("mm: thp: enable thp migration in generic path"), it initially did not handle the subpages of swapcached THPs, failing to update their ->private fields or replace the subpage pointers in the swapcache. Subsequently, commit e71769ae ("mm: enable thp migration for shmem thp") addressed the swapcache update aspect. This patch fixes the update of subpage ->private fields. Closes: https://lore.kernel.org/linux-mm/1707814102-22682-1-git-send-email-quic_charante@quicinc.com/ Fixes: 616b8371 ("mm: thp: enable thp migration in generic path") Signed-off-by:
Zi Yan <ziy@nvidia.com> Acked-by:
David Hildenbrand <david@redhat.com> Reported-and-tested-by:
Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Jan 23, 2024
-
-
[ Upstream commit 0201ebf2 ] Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by:
Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by:
Matthew Wilcox <willy@infradead.org> Signed-off-by:
David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 1898efcd ("block: update the stable_writes flag in bdev_add") Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
commit fc346d0a upstream. Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. [willy@infradead.org: modifications to the changelog & tweaked the fix] Fixes: 3417013e ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org Signed-off-by:
Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by:
Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Nov 30, 2023
-
-
commit 229e2253 upstream. do_pages_move does not handle compat pointers for the page list. correctly. Add in_compat_syscall check and appropriate get_user fetch when iterating the page list. It makes the syscall in compat mode (32-bit userspace, 64-bit kernel) work the same way as the native 32-bit syscall again, restoring the behavior before my broken commit 5b1b561b ("mm: simplify compat_sys_move_pages"). More specifically, my patch moved the parsing of the 'pages' array from the main entry point into do_pages_stat(), which left the syscall working correctly for the 'stat' operation (nodes = NULL), while the 'move' operation (nodes != NULL) is now missing the conversion and interprets 'pages' as an array of 64-bit pointers instead of the intended 32-bit userspace pointers. It is possible that nobody noticed this bug because the few applications that actually call move_pages are unlikely to run in compat mode because of their large memory requirements, but this clearly fixes a user-visible regression and should have been caught by ltp. Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com Fixes: 5b1b561b ("mm: simplify compat_sys_move_pages") Signed-off-by:
Gregory Price <gregory.price@memverge.com> Reported-by:
Arnd Bergmann <arnd@arndb.de> Co-developed-by:
Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Feb 22, 2023
-
-
Peter Xu authored
commit 96a9c287 upstream. Nick Bowler reported another sparc64 breakage after the young/dirty persistent work for page migration (per "Link:" below). That's after a similar report [2]. It turns out page migration was overlooked, and it wasn't failing before because page migration was not enabled in the initial report test environment. David proposed another way [2] to fix this from sparc64 side, but that patch didn't land somehow. Neither did I check whether there's any other arch that has similar issues. Let's fix it for now as simple as moving the write bit handling to be after dirty, like what we did before. Note: this is based on mm-unstable, because the breakage was since 6.1 and we're at a very late stage of 6.2 (-rc8), so I assume for this specific case we should target this at 6.3. [1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ [2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/ Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com Fixes: 2e346877 ("mm: remember young/dirty bit for page migrations") Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/ Signed-off-by:
Peter Xu <peterx@redhat.com> Reported-by:
Nick Bowler <nbowler@draconx.ca> Acked-by:
David Hildenbrand <david@redhat.com> Tested-by:
Nick Bowler <nbowler@draconx.ca> Cc: <regressions@lists.linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Oct 28, 2022
-
-
Baolin Wang authored
During THP migration, if THPs are not migrated but they are split and all subpages are migrated successfully, migrate_pages() will still return the number of THP pages that were not migrated. This will confuse the callers of migrate_pages(). For example, the longterm pinning will failed though all pages are migrated successfully. Thus we should return 0 to indicate that all pages are migrated in this case Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com Fixes: b5bade97 ("mm: migrate: fix the return value of migrate_pages()") Signed-off-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Alistair Popple <apopple@nvidia.com> Reviewed-by:
Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Oct 13, 2022
-
-
Alistair Popple authored
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by:
Alistair Popple <apopple@nvidia.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Oct 03, 2022
-
-
Matthew Wilcox (Oracle) authored
With all callers now passing in a folio, rename the function and convert all callers. Removes a couple of calls to compound_head() and a reference to page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Saves several calls to compound_head() and removes a couple of uses of page->lru. Link: https://lkml.kernel.org/r/20220902194653.1739778-52-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Removes a lot of calls to compound_head(). Also remove a VM_BUG_ON that can never trigger as the PageAnon bit is the bottom bit of page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-51-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Sep 27, 2022
-
-
Haiyue Wang authored
The handling Non-LRU pages returned by follow_page() jumps directly, it doesn't call put_page() to handle the reference count, since 'FOLL_GET' flag for follow_page() has get_page() called. Fix the zone device page check by handling the page reference count correctly before returning. And as David reviewed, "device pages are never PageKsm pages". Drop this zone device page check for break_ksm(). Since the zone device page can't be a transparent huge page, so drop the redundant zone device page check for split_huge_pages_pid(). (by Miaohe) Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com Fixes: 3218f871 ("mm: handling Non-LRU pages returned by vm_normal_pages") Signed-off-by:
Haiyue Wang <haiyue.wang@intel.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Alistair Popple <apopple@nvidia.com> Reviewed-by:
Miaohe Lin <linmiaohe@huawei.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Aneesh Kumar K.V authored
With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier [sj@kernel.org: include missed header file, memory-tiers.h] Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h] [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers] Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Acked-by:
Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Aneesh Kumar K.V authored
This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Acked-by:
Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Aneesh Kumar K.V authored
This moves memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Acked-by:
Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will be split, and the subpages of fail-to-migrate THP will be tried to migrate again, so we should not account the retry counter in the second loop, since we already accounted 'nr_thp_failed' in the first loop. Moreover we also do not need retry 10 times for -EAGAIN case for the subpages of fail-to-migrate THP in the second loop, since we already regarded the THP as migration failure, and save some migration time (for the worst case, will try 512 * 10 times) according to previous discussion [1]. [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com Tested-by:
"Huang, Ying" <ying.huang@intel.com> Signed-off-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
After 10 retries, we will give up and the remaining pages will be counted as failure in nr_failed and nr_thp_failed. We should count the failure in nr_failed_pages too. This is done in this patch. Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
If THP is failed to be migrated, it may be split and retry. But after splitting, the head page will be left in "from" list, although THP migration failure has been counted already. If the head page is failed to be migrated too, the failure will be counted twice incorrectly. So this is fixed in this patch via moving the head page of THP after splitting to "thp_split_pages" too. Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
If THP or hugetlbfs page migration isn't supported, unmap_and_move() or unmap_and_move_huge_page() will return -ENOSYS. For THP, splitting will be tried, but if splitting doesn't succeed, the THP will be left in "from" list wrongly. If some other pages are retried, the THP migration failure will counted again. This is fixed via moving the failure THP from "from" to "ret_pages". Another issue of the original code is that the unsupported failure processing isn't consistent between THP and hugetlbfs page. Make them consistent in this patch to make the code easier to be understood too. Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be split into thp_split_pages, and after other pages are migrated, pages in thp_split_pages will be migrated with no_subpage_counting == true, because its failure have been counted already. If some pages in thp_split_pages are retried during migration, we should not count their failure if no_subpage_counting == true too. This is done this patch to fix the failure counting for THP subpages retrying. Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be returned, and migrate_pages() will try to split the THP unless "reason" is MR_NUMA_MISPLACED (that is, nosplit == true). But when nosplit == true, the THP migration failure will not be counted. This is incorrect, so in this patch, the THP migration failure will be counted for -ENOMEM regardless of nosplit is true or false. The nr_failed counting isn't fixed because it's not used. Added some comments for it per Baolin's suggestion. Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
Before commit b5bade97 ("mm: migrate: fix the return value of migrate_pages()"), the tail pages of THP will be put in the "from" list directly. So one of the loop cursors (page2) needs to be reset, as is done in try_split_thp() via list_safe_reset_next(). But after the commit, the tail pages of THP will be put in a dedicated list (thp_split_pages). That is, the "from" list will not be changed during splitting. So, it's unnecessary to call list_safe_reset_next() anymore. This is a code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
Patch series "migrate_pages(): fix several bugs in error path", v3. During review the code of migrate_pages() and build a test program for it. Several bugs in error path are identified and fixed in this series. Most patches are tested via - Apply error-inject.patch in Linux kernel - Compile test-migrate.c (with -lnuma) - Test with test-migrate.sh error-inject.patch, test-migrate.c, and test-migrate.sh are as below. It turns out that error injection is an important tool to fix bugs in error path. This patch (of 8): The return value of move_pages() syscall is incorrect when counting the remaining pages to be migrated. For example, for the following test program, " #define _GNU_SOURCE #include <stdbool.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <fcntl.h> #include <sys/uio.h> #include <sys/mman.h> #include <sys/types.h> #include <unistd.h> #include <numaif.h> #include <numa.h> #ifndef MADV_FREE #define MADV_FREE 8 /* free pages only if memory pressure */ #endif #define ONE_MB (1024 * 1024) #define MAP_SIZE (16 * ONE_MB) #define THP_SIZE (2 * ONE_MB) #define THP_MASK (THP_SIZE - 1) #define ERR_EXIT_ON(cond, msg) \ do { \ int __cond_in_macro = (cond); \ if (__cond_in_macro) \ error_exit(__cond_in_macro, (msg)); \ } while (0) void error_msg(int ret, int nr, int *status, const char *msg) { int i; fprintf(stderr, "Error: %s, ret : %d, error: %s\n", msg, ret, strerror(errno)); if (!nr) return; fprintf(stderr, "status: "); for (i = 0; i < nr; i++) fprintf(stderr, "%d ", status[i]); fprintf(stderr, "\n"); } void error_exit(int ret, const char *msg) { error_msg(ret, 0, NULL, msg); exit(1); } int page_size; bool do_vmsplice; bool do_thp; static int pipe_fds[2]; void *addr; char *pn; char *pn1; void *pages[2]; int status[2]; void prepare() { int ret; struct iovec iov; if (addr) { munmap(addr, MAP_SIZE); close(pipe_fds[0]); close(pipe_fds[1]); } ret = pipe(pipe_fds); ERR_EXIT_ON(ret, "pipe"); addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); ERR_EXIT_ON(addr == MAP_FAILED, "mmap"); if (do_thp) { ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE); ERR_EXIT_ON(ret, "advise hugepage"); } pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK); pn1 = pn + THP_SIZE; pages[0] = pn; pages[1] = pn1; *pn = 1; if (do_vmsplice) { iov.iov_base = pn; iov.iov_len = page_size; ret = vmsplice(pipe_fds[1], &iov, 1, 0); ERR_EXIT_ON(ret < 0, "vmsplice"); } status[0] = status[1] = 1024; } void test_migrate() { int ret; int nodes[2] = { 1, 1 }; pid_t pid = getpid(); prepare(); ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 1, status, "move 1 page"); prepare(); ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 not mapped"); prepare(); *pn1 = 1; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages"); prepare(); *pn1 = 1; nodes[1] = 0; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 to node 0"); } int main(int argc, char *argv[]) { numa_run_on_node(0); page_size = getpagesize(); test_migrate(); fprintf(stderr, "\nMake page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); fprintf(stderr, "\nTest THP:\n"); do_thp = true; do_vmsplice = false; test_migrate(); fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); return 0; } " The output of the current kernel is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 0, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 0, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 1, error: Success status: 1024 1024 " While the expected output is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 1, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 1, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 2, error: Success status: 1024 1024 " Fix this via correcting the remaining pages counting. With the fix, the output for the test program as above is expected. Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset is big enough to store both the PFN value and the A/D bits. Otherwise the A/D bits are dropped like before. Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com Signed-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Sep 12, 2022
-
-
Huang Ying authored
Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Haiyue Wang authored
Not all huge page APIs support FOLL_GET option, so move_pages() syscall will fail to get the page node information for some huge pages. Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will return NULL page for FOLL_GET when calling move_pages() syscall with the NULL 'nodes' parameter, the 'status' parameter has '-2' error in array. Note: follow_huge_pud() now supports FOLL_GET in linux 6.0. Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev But these huge page APIs don't support FOLL_GET: 1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c 2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c It will cause WARN_ON_ONCE for FOLL_GET. 3. follow_huge_pgd() in mm/hugetlb.c This is an temporary solution to mitigate the side effect of the race condition fix by calling follow_page() with FOLL_GET set for huge pages. After supporting follow huge page by FOLL_GET is done, this fix can be reverted safely. Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com Fixes: 4cd61484 ("mm: migration: fix possible do_pages_stat_array racing with memory offline") Signed-off-by:
Haiyue Wang <haiyue.wang@intel.com> Reviewed-by:
Miaohe Lin <linmiaohe@huawei.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Aug 02, 2022
-
-
Matthew Wilcox (Oracle) authored
With all users converted to migrate_folio(), remove this operation. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Christoph Hellwig <hch@lst.de>
-
Matthew Wilcox (Oracle) authored
This involves converting migrate_huge_page_move_mapping(). We also need a folio variant of hugetlb_set_page_subpool(), but that's for a later patch. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by:
Muchun Song <songmuchun@bytedance.com> Reviewed-by:
Mike Kravetz <mike.kravetz@oracle.com>
-
Matthew Wilcox (Oracle) authored
There is nothing iomap-specific about iomap_migratepage(), and it fits a pattern used by several other filesystems, so move it to mm/migrate.c, convert it to be filemap_migrate_folio() and convert the iomap filesystems to use it. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <djwong@kernel.org>
-