- Aug 08, 2024
-
-
Shakeel Butt authored
Commit 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by:
Shakeel Butt <shakeel.butt@linux.dev> Acked-by:
Muchun Song <muchun.song@linux.dev> Reviewed-by:
Roman Gushchin <roman.gushchin@linux.dev> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
In the shmem_suitable_orders() function, xa_find() is used to check for conflicts in the pagecache to select suitable huge orders. However, when checking each huge order in every loop, the aligned index is calculated from the previous iteration, which may cause suitable huge orders to be missed. We should use the original index each time in the loop to calculate a new aligned index for checking conflicts to avoid this issue. Link: https://lkml.kernel.org/r/07433b0f16a152bffb8cee34934a5c040e8e2ad6.1722404078.git.baolin.wang@linux.alibaba.com Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Similar to commit d659b715 ("mm/huge_memory: avoid PMD-size page cache if needed"), ARM64 can support 512MB PMD-sized THP when the base page size is 64KB, which is larger than the maximum supported page cache size MAX_PAGECACHE_ORDER. This is not expected. To fix this issue, use THP_ORDERS_ALL_FILE_DEFAULT for shmem to filter allowable huge orders. [baolin.wang@linux.alibaba.com: remove comment, per Barry] Link: https://lkml.kernel.org/r/c55d7ef7-78aa-4ed6-b897-c3e03a3f3ab7@linux.alibaba.com [wangkefeng.wang@huawei.com: remove local `orders'] Link: https://lkml.kernel.org/r/87769ae8-b6c6-4454-925d-1864364af9c8@huawei.com Link: https://lkml.kernel.org/r/117121665254442c3c7f585248296495e5e2b45c.1722404078.git.baolin.wang@linux.alibaba.com Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Barry Song <baohua@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Muchun Song authored
The mem_cgroup_from_slab_obj() is supposed to be called under rcu lock or cgroup_mutex or others which could prevent returned memcg from being freed. Fix it by adding missing rcu read lock. Found by code inspection. [songmuchun@bytedance.com: only grab rcu lock when necessary, per Vlastimil] Link: https://lkml.kernel.org/r/20240801024603.1865-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20240718083607.42068-1-songmuchun@bytedance.com Fixes: 0a97c01c ("list_lru: allow explicit memcg and NUMA node selection") Signed-off-by:
Muchun Song <songmuchun@bytedance.com> Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jul 30, 2024
-
-
Rik van Riel authored
In 782f8906 the freeing of kfence objects was moved from deep inside do_slab_free to the wrapper functions outside. This is a nice change, but unfortunately it missed one spot in __kmem_cache_free_bulk. This results in a crash like this: BUG skbuff_head_cache (Tainted: G S B E ): Padding overwritten. 0xffff88907fea0f00-0xffff88907fea0fff @offset=3840 slab_err (mm/slub.c:1129) free_to_partial_list (mm/slub.c:? mm/slub.c:4036) slab_pad_check (mm/slub.c:864 mm/slub.c:1290) check_slab (mm/slub.c:?) free_to_partial_list (mm/slub.c:3171 mm/slub.c:4036) kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4495 mm/slub.c:4586 mm/slub.c:4635) napi_build_skb (net/core/skbuff.c:348 net/core/skbuff.c:527 net/core/skbuff.c:549) All the other callers to do_slab_free appear to be ok. Add a kfence_free check in __kmem_cache_free_bulk to avoid the crash. Reported-by:
Chris Mason <clm@meta.com> Fixes: 782f8906 ("mm/slub: free KFENCE objects in slab_free_hook()") Cc: stable@kernel.org Signed-off-by:
Rik van Riel <riel@surriel.com> Signed-off-by:
Vlastimil Babka <vbabka@suse.cz>
-
- Jul 28, 2024
-
-
Linus Torvalds authored
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jul 26, 2024
-
-
Li Zhijian authored
It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com Fixes: 4b23a68f ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by:
Li Zhijian <lizhijian@fujitsu.com> Reported-by:
Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Suren Baghdasaryan authored
Outline and export free_reserved_page() because modules use it and it in turn uses page_ext_{get|put} which should not be exported. The same result could be obtained by outlining {get|put}_page_tag_ref() but that would have higher performance impact as these functions are used in more performance critical paths. Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com Fixes: dcfe378c ("lib: introduce support for page allocation tagging") Signed-off-by:
Suren Baghdasaryan <surenb@google.com> Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/ Suggested-by:
Christoph Hellwig <hch@infradead.org> Suggested-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Gavin Shan authored
xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d9064 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open the xfs file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray entry. Write permission is needed */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a ("mm: Use multi-index entries in the page cache") Signed-off-by:
Gavin Shan <gshan@redhat.com> Acked-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Ryan Roberts <ryan.roberts@arm.com> Acked-by:
Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
Yves-Alexis Perez reported commit 4ef9ad19 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com Fixes: 4ef9ad19 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by:
Yang Shi <yang@os.amperecomputing.com> Reported-by:
Yves-Alexis Perez <corsac@debian.org> Tested-by:
Yves-Alexis Perez <corsac@debian.org> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ram Tummala authored
Commit 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() || do_cow_fault() || do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com Fixes: 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by:
Ram Tummala <rtummala@nvidia.com> Reviewed-by:
Yin Fengwei <fengwei.yin@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jul 24, 2024
-
-
Joel Granados authored
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by:
Thomas Weißschuh <linux@weissschuh.net> Signed-off-by:
Thomas Weißschuh <linux@weissschuh.net> Co-developed-by:
Joel Granados <j.granados@samsung.com> Signed-off-by:
Joel Granados <j.granados@samsung.com>
-
- Jul 19, 2024
-
-
Jason A. Donenfeld authored
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. - It shouldn't reserve actual memory, but it also shouldn't crash when page faulting in memory if none is available * Uh-oh: VM_NORESERVE means segfaults. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) It is inherited by fork. d) It doesn't count against the mlock budget, since nothing is locked. e) If there's not enough memory to service a page fault, it's not fatal, and no signal is sent. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with OOMing, crashing on overcommitment, using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. In order to let vDSO getrandom() use this, expose these via mmap(2) as MAP_DROPPABLE. Note that this involves removing the MADV_FREE special case from sort_folio(), which according to Yu Zhao is unnecessary and will simply result in an extra call to shrink_folio_list() in the worst case. The chunk removed reenables the swapbacked flag, which we don't want for VM_DROPPABLE, and we can't conditionalize it here because there isn't a vma reference available. Finally, the provided self test ensures that this is working as desired. Cc: linux-mm@kvack.org Acked-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
- Jul 18, 2024
-
-
Yu Zhao authored
mem_cgroup_calculate_protection() is not stateless and should only be used as part of a top-down tree traversal. shrink_one() traverses the per-node memcg LRU instead of the root_mem_cgroup tree, and therefore it should not call mem_cgroup_calculate_protection(). The existing misuse in shrink_one() can cause ineffective protection of sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to calculate the protection. Previously lru_gen_age_node() opportunistically skips the first pass, i.e., when scan_control->priority is DEF_PRIORITY. On the second pass, lruvec_is_sizable() uses appropriate scan_control->priority, set by set_initial_priority() from lru_gen_shrink_node(), to decide whether a memcg is too small to reclaim from. Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree. So it should call set_initial_priority() upfront, to make sure lruvec_is_sizable() uses appropriate scan_control->priority on the first pass. Otherwise, lruvec_is_reclaimable() can return false negatives and result in premature OOM kills when min_ttl_ms is used. Link: https://lkml.kernel.org/r/20240712232956.1427127-1-yuzhao@google.com Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by:
Yu Zhao <yuzhao@google.com> Reported-by:
T.J. Mercier <tjmercier@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Dan Carpenter authored
We accidentally deleted a tab in commit f84152e9efc5 ("mm/zswap: use only one pool in zswap"). Add it back. Link: https://lkml.kernel.org/r/c15066a0-f061-42c9-b0f5-d60281d3d5d8@stanley.mountain Signed-off-by:
Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by:
Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
A kernel crash was observed when migrating hugetlb folio: BUG: kernel NULL pointer dereference, address: 0000000000000008 PGD 0 P4D 0 Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66 RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0 RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097 RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0 RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868 R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0 R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001 FS: 00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0 Call Trace: <TASK> __folio_migrate_mapping+0x59e/0x950 __migrate_folio.constprop.0+0x5f/0x120 move_to_new_folio+0xfd/0x250 migrate_pages+0x383/0xd70 soft_offline_page+0x2ab/0x7f0 soft_offline_page_store+0x52/0x90 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5b3a514887 RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887 RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001 RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00 It's because hugetlb folio is passed to __folio_undo_large_rmappable() unexpectedly. large_rmappable flag is imperceptibly set to hugetlb folio since commit f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio"). Then commit be9581ea ("mm: fix crashes from deferred split racing folio migration") makes folio_migrate_mapping() call folio_undo_large_rmappable() triggering the bug. Fix this issue by clearing large_rmappable flag for hugetlb folios. They don't need that flag set anyway. Link: https://lkml.kernel.org/r/20240709120433.4136700-1-linmiaohe@huawei.com Fixes: f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio") Fixes: be9581ea ("mm: fix crashes from deferred split racing folio migration") Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When tries to demote 1G hugetlb folios, a lockdep warning is observed: ============================================ WARNING: possible recursive locking detected 6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Not tainted -------------------------------------------- bash/710 is trying to acquire lock: ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460 but task is already holding lock: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&h->resize_lock); lock(&h->resize_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by bash/710: #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 stack backtrace: CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 __lock_acquire+0x10f2/0x1ca0 lock_acquire+0xbe/0x2d0 __mutex_lock+0x6d/0x400 demote_store+0x244/0x460 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa61db14887 RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887 RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001 RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00 </TASK> Lockdep considers this an AA deadlock because the different resize_lock mutexes reside in the same lockdep class, but this is a false positive. Place them in distinct classes to avoid these warnings. Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support") Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Acked-by:
Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
yangge authored
If a large number of CMA memory are configured in system (for example, the CMA memory accounts for 50% of the system memory), starting a virtual virtual machine with device passthrough, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory. Normally if a page is present and in CMA area, pin_user_pages_remote() will migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM flag. But the current code will cause the migration failure due to unexpected page refcounts, and eventually cause the virtual machine fail to start. If a page is added in LRU batch, its refcount increases one, remove the page from LRU batch decreases one. Page migration requires the page is not referenced by others except page mapping. Before migrating a page, we should try to drain the page from LRU batch in case the page is in it, however, folio_test_lru() is not sufficient to tell whether the page is in LRU batch or not, if the page is in LRU batch, the migration will fail. To solve the problem above, we modify the logic of adding to LRU batch. Before adding a page to LRU batch, we clear the LRU flag of the page so that we can check whether the page is in LRU batch by folio_test_lru(page). It's quite valuable, because likely we don't want to blindly drain the LRU batch simply because there is some unexpected reference on a page, as described above. This change makes the LRU flag of a page invisible for longer, which may impact some programs. For example, as long as a page is on a LRU batch, we cannot isolate it, and we cannot check if it's an LRU page. Further, a page can now only be on exactly one LRU batch. This doesn't seem to matter much, because a new page is allocated from buddy and added to the lru batch, or be isolated, it's LRU flag may also be invisible for a long time. Link: https://lkml.kernel.org/r/1720075944-27201-1-git-send-email-yangge1116@126.com Link: https://lkml.kernel.org/r/1720008153-16035-1-git-send-email-yangge1116@126.com Fixes: 9a4e9f3b ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by:
yangge <yangge1116@126.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Tvrtko Ursulin authored
Since balancing mode was added in bda420b9 ("numa balancing: migrate on fault among multiple bound nodes"), it was possible to set this mode but it wouldn't be shown in /proc/<pid>/numa_maps since there was no support for it in the mpol_to_str() helper. Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it would be displayed as 'default' due a workaround introduced a few years earlier in 8790c71a ("mm/mempolicy.c: fix mempolicy printing in numa_maps"). To tidy this up we implement two changes: Replace the MPOL_F_MORON check by pointer comparison against the preferred_node_policy array. By doing this we generalise the current special casing and replace the incorrect 'default' with the correct 'bind' for the mode. Secondly, we add a string representation and corresponding handling for the MPOL_F_NUMA_BALANCING flag. With the two changes together we start showing the balancing flag when it is set and therefore complete the fix. Representation format chosen is to separate multiple flags with vertical bars, following what existed long time ago in kernel 2.6.25. But as between then and now there wasn't a way to display multiple flags, this patch does not change the format in practice. Some /proc/<pid>/numa_maps output examples: 555559580000 bind=balancing:0-1,3 file=... 555585800000 bind=balancing|static:0,2 file=... 555635240000 prefer=relative:0 file= Link: https://lkml.kernel.org/r/20240708075632.95857-1-tursulin@igalia.com Signed-off-by:
Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes") References: 8790c71a ("mm/mempolicy.c: fix mempolicy printing in numa_maps") Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Roman Gushchin authored
Currently MOVE_ANON and MOVE_FILE flags are defined as integers and it leads to the following Smatch static checker warning: mm/memcontrol-v1.c:609 mem_cgroup_move_charge_write() warn: was expecting a 64 bit value instead of '~(1 | 2)' Fix this be redefining them as unsigned long long. Even though the issue allows to set high 32 bits of mc.flags to an arbitrary number, these bits are never used, so it doesn't have any significant consequences. Link: https://lkml.kernel.org/r/ZpF8Q9zBsIY7d2P9@google.com Signed-off-by:
Roman Gushchin <roman.gushchin@linux.dev> Reported-by:
Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yu Zhao authored
set_initial_priority() tries to jump-start global reclaim by estimating the priority based on cold/hot LRU pages. The estimation does not account for shrinker objects, and it cannot do so because their sizes can be in different units other than page. If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where ZFS ARC can use almost all system memory, set_initial_priority() can vastly underestimate how much memory ARC shrinker can evict and assign extreme low values to scan_control->priority, resulting in overshoots of shrinker objects. To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a test ZFS pool and the following commands: fio --name=mglru.file --numjobs=36 --ioengine=io_uring \ --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \ --rw=randread --random_distribution=random \ --time_based --runtime=1h & for ((i = 0; i < 20; i++)) do sleep 120 fio --name=mglru.anon --numjobs=16 --ioengine=mmap \ --filename=/dev/zero --size=1024m --fadvise_hint=0 \ --rw=randrw --random_distribution=random \ --time_based --runtime=1m done To fix the problem: 1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent the jump-start from being overly aggressive. 2. Account for the progress from mm_account_reclaimed_pages(), to prevent kswapd_shrink_node() from raising the priority unnecessarily. Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by:
Yu Zhao <yuzhao@google.com> Reported-by:
Alexander Motin <mav@ixsystems.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yu Zhao authored
evict_folios() uses a second pass to reclaim folios that have gone through page writeback and become clean before it finishes the first pass, since folio_rotate_reclaimable() cannot handle those folios due to the isolation. The second pass tries to avoid potential double counting by deducting scan_control->nr_scanned. However, this can result in underflow of nr_scanned, under a condition where shrink_folio_list() does not increment nr_scanned, i.e., when folio_trylock() fails. The underflow can cause the divisor, i.e., scale=scanned+reclaimed in vmpressure_calc_level(), to become zero, resulting in the following crash: [exception RIP: vmpressure_work_fn+101] process_one_work at ffffffffa3313f2b Since scan_control->nr_scanned has no established semantics, the potential double counting has minimal risks. Therefore, fix the problem by not deducting scan_control->nr_scanned in evict_folios(). Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e14 ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by:
Wei Xu <weixugc@google.com> Signed-off-by:
Yu Zhao <yuzhao@google.com> Cc: Alexander Motin <mav@ixsystems.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kees Cook authored
Replace the depreciated[1] strncpy() calls with strscpy(). Uses of object->comm do not depend on the padding side-effect. Link: https://github.com/KSPP/linux/issues/90 [1] Link: https://lkml.kernel.org/r/20240710001300.work.004-kees@kernel.org Signed-off-by:
Kees Cook <kees@kernel.org> Acked-by:
Catalin Marinas <catalin.marinas@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
This mostly reverts commit af3b8544 ("mm/page_alloc.c: allow error injection"). The commit made should_fail_alloc_page() a noinline function that's always called from the page allocation hotpath, even if it's empty because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to disable it and prevent the associated function call overhead. As with the preceding patch "mm, slab: put should_failslab back behind CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the should_fail_alloc_page() back behind the config option. When enabled, the ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a complete revert. Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.cz Signed-off-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Rientjes <rientjes@google.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Vlastimil Babka authored
Patch series "revert unconditional slab and page allocator fault injection calls". These two patches largely revert commits that added function call overhead into slab and page allocation hotpaths and that cannot be currently disabled even though related CONFIG_ options do exist. A much more involved solution that can keep the callsites always existing but hidden behind a static key if unused, is possible [1] and can be pursued by anyone who believes it's necessary. Meanwhile the fact the should_failslab() error injection is already not functional on kernels built with current gcc without anyone noticing [2], and lukewarm response to [1] suggests the need is not there. I believe it will be more fair to have the state after this series as a baseline for possible further optimisation, instead of the unconditional overhead. For example a possible compromise for anyone who's fine with an empty function call overhead but not the full CONFIG_FAILSLAB / CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a static key check only inside should_failslab() and should_fail_alloc_page() before performing the more expensive checks. [1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t [2] https://github.com/bpftrace/bpftrace/issues/3258 This patch (of 2): This mostly reverts commit 4f6923fb ("mm: make should_failslab always available for fault injection"). The commit made should_failslab() a noinline function that's always called from the slab allocation hotpath, even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and there is no option to disable that call. This is visible in profiles and the function call overhead can be noticeable especially with cpu mitigations. Meanwhile the bpftrace program example in the commit silently does not work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the empty function gets a .constprop clone that is actually being called (uselessly) from the slab hotpath, while the error injection is hooked to the original function that's not being called at all [1]. Thus put the whole should_failslab() function back behind CONFIG_SHOULD_FAILSLAB. It's not a complete revert of 4f6923fb - the int return type that returns -ENOMEM on failure is preserved, as well ALLOW_ERROR_INJECTION annotation. The BTF_ID() record that was meanwhile added is also guarded by CONFIG_SHOULD_FAILSLAB. [1] https://github.com/bpftrace/bpftrace/issues/3258 Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.cz Signed-off-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Rientjes <rientjes@google.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Pei Li authored
Syzbot reported a possible data race: BUG: KCSAN: data-race in __swap_writepage / scan_swap_map_slots read-write to 0xffff888102fca610 of 8 bytes by task 7106 on cpu 1. read to 0xffff888102fca610 of 8 bytes by task 7080 on cpu 0. While we are in __swap_writepage to read sis->flags, scan_swap_map_slots is trying to update it with SWP_SCANNING. value changed: 0x0000000000008083 -> 0x0000000000004083. While this can be updated non-atomicially, this won't affect SWP_SYNCHRONOUS_IO, so we consider this data-race safe. This is possibly introduced by commit 3222d8c2 ("block: remove ->rw_page"), where this if branch is introduced. Link: https://lkml.kernel.org/r/20240711-bug13-v1-1-cea2b8ae8d76@gmail.com Fixes: 3222d8c2 ("block: remove ->rw_page") Signed-off-by:
Pei Li <peili.dev@gmail.com> Reported-by:
<syzbot+da25887cc13da6bf3b8c@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=da25887cc13da6bf3b8c Cc: Dan Williams <dan.j.williams@intel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jul 15, 2024
-
-
Alex Shi (Tencent) authored
commit 21c690a3 ("mm: introduce slabobj_ext to support slab object extensions") changed the folio/page->memcg_data define condition from MEMCG to SLAB_OBJ_EXT. This action make memcg_data exposed while !MEMCG. As Vlastimil Babka suggested, let's add _unused_slab_obj_exts for SLAB_MATCH for slab.obj_exts while !MEMCG. That could resolve the match issue, clean up the feature logical. Signed-off-by:
Alex Shi (Tencent) <alexs@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Vlastimil Babka <vbabka@suse.cz>
-
- Jul 12, 2024
-
-
Ryan Roberts authored
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc, thp_file_fallback and thp_file_fallback_charge, which rather confusingly refer to shmem THP and do not include any other types of file pages. This is inconsistent since in most other places in the kernel, THP counters are explicitly separated for anon, shmem and file flavours. However, we are stuck with it since it constitutes a user ABI. Recently, commit 66f44583 ("mm: shmem: add mTHP counters for anonymous shmem") added equivalent mTHP stats for shmem, keeping the same "file_" prefix in the names. But in future, we may want to add extra stats to cover actual file pages, at which point, it would all become very confusing. So let's take the opportunity to rename these new counters "shmem_" before the change makes it upstream and the ABI becomes immutable. While we are at it, let's improve the documentation for the legacy counters to make it clear that they count shmem pages only. Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by:
Lance Yang <ioworker0@gmail.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Reviewed-by:
Barry Song <baohua@kernel.org> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Convert to use folio_alloc_mpol() helper() in __read_swap_cache_async(). Link: https://lkml.kernel.org/r/20240709105508.3933823-1-wangkefeng.wang@huawei.com Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This issue is not from any report yet, but by code observation only. This is yet another fix besides Hugh's patch [1] but on relevant code path, where eager split of folio can happen if the folio is already on deferred list during a folio migration. Here the issue is NUMA path (migrate_misplaced_folio()) may start to encounter such folio split now even with MR_NUMA_MISPLACED hint applied. Then when migrate_pages() didn't migrate all the folios, it's possible the split small folios be put onto the list instead of the original folio. Then putting back only the head page won't be enough. Fix it by putting back all the folios on the list. [1] https://lore.kernel.org/all/46c948b4-4dd8-6e03-4c7b-ce4e81cfa536@google.com/ [akpm@linux-foundation.org: remove now unused local `nr_pages'] Link: https://lkml.kernel.org/r/20240708215537.2630610-1-peterx@redhat.com Fixes: 7262f208 ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yu Zhao authored
Make clear_shadow_entry() clear shadow entries in `struct folio_batch` so that it can reduce contention on i_lock and i_pages locks, e.g., watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] clear_shadow_entry+0x3d/0x100 mapping_try_invalidate+0x117/0x1d0 invalidate_mapping_pages+0x10/0x20 invalidate_bdev+0x3c/0x50 blkdev_common_ioctl+0x5f7/0xa90 blkdev_ioctl+0x109/0x270 Also, rename clear_shadow_entry() to clear_shadow_entries() accordingly. [yuzhao@google.com: v2] Link: https://lkml.kernel.org/r/20240710060933.3979380-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20240708212753.3120511-1-yuzhao@google.com Reported-by:
Bharata B Rao <bharata@amd.com> Closes: https://lore.kernel.org/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/ Signed-off-by:
Yu Zhao <yuzhao@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
The page cannot become compound pages again just after a folio is split as an extra refcnt is held. So the MF_MSG_DIFFERENT_COMPOUND case is obsolete and can be removed to get rid of this false assumption and code burden. But add one WARN_ON() here to keep the situation clear. Link: https://lkml.kernel.org/r/20240708030544.196919-1-linmiaohe@huawei.com Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Hugh Dickins authored
Now that folio_undo_large_rmappable() is an inline function checking order and large_rmappable for itself (and __folio_undo_large_rmappable() is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is no need for folio_migrate_mapping() to check large and large_rmappable first (in the mapping case when it has had to freeze anyway). Link: https://lkml.kernel.org/r/68feee73-050e-8e98-7a3a-abf78738d92c@google.com Signed-off-by:
Hugh Dickins <hughd@google.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Wei Yang authored
__free_pages_core() is only used in bootmem init and hot-add memory init path. Let's put it in __meminit section. Link: https://lkml.kernel.org/r/20240706061615.30322-1-richard.weiyang@gmail.com Signed-off-by:
Wei Yang <richard.weiyang@gmail.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Bang Li authored
After the commit 7fb1b252afb5 ("mm: shmem: add mTHP support for anonymous shmem"), we can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But currently "THPeligible" indicates only whether the mapping is eligible for allocating THP-pages as well as the THP is PMD mappable or not for anonymous shmem, we need to support semantics for mTHP with anonymous shmem similar to those for mTHP with anonymous memory. Link: https://lkml.kernel.org/r/20240705032309.24933-1-libang.li@antgroup.com Signed-off-by:
Bang Li <libang.li@antgroup.com> Reviewed-by:
Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ryan Roberts authored
Since the introduction of mTHP, the docuementation has stated that khugepaged would be enabled when any mTHP size is enabled, and disabled when all mTHP sizes are disabled. There are 2 problems with this; 1. this is not what was implemented by the code and 2. this is not the desirable behavior. Desirable behavior is for khugepaged to be enabled when any PMD-sized THP is enabled, anon or file. (Note that file THP is still controlled by the top-level control so we must always consider that, as well as the PMD-size mTHP control for anon). khugepaged only supports collapsing to PMD-sized THP so there is no value in enabling it when PMD-sized THP is disabled. So let's change the code and documentation to reflect this policy. Further, per-size enabled control modification events were not previously forwarded to khugepaged to give it an opportunity to start or stop. Consequently the following was resulting in khugepaged eroneously not being activated: echo never > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled [ryan.roberts@arm.com: v3] Link: https://lkml.kernel.org/r/20240705102849.2479686-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240705102849.2479686-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240704091051.2411934-1-ryan.roberts@arm.com Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com> Fixes: 3485b883 ("mm: thp: introduce multi-size THP sysfs interface") Closes: https://lore.kernel.org/linux-mm/7a0bbe69-1e3d-4263-b206-da007791a5c4@redhat.com/ Acked-by:
David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ho-Ren (Jack) Chuang authored
The current memory tier initialization process is distributed across two different functions, memory_tier_init() and memory_tier_late_init(). This design is hard to maintain. Thus, this patch is proposed to reduce the possible code paths by consolidating different initialization patches into one. The earlier discussion with Jonathan and Ying is listed here: https://lore.kernel.org/lkml/20240405150244.00004b49@Huawei.com/ If we want to put these two initializations together, they must be placed together in the later function. Because only at that time, the HMAT information will be ready, adist between nodes can be calculated, and memory tiering can be established based on the adist. So we position the initialization at memory_tier_init() to the memory_tier_late_init() call. Moreover, it's natural to keep memory_tier initialization in drivers at device_initcall() level. If we simply move the set_node_memory_tier() from memory_tier_init() to late_initcall(), it will result in HMAT not registering the mt_adistance_algorithm callback function, because set_node_memory_tier() is not performed during the memory tiering initialization phase, leading to a lack of correct default_dram information. Therefore, we introduced a nodemask to pass the information of the default DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes is that it is not clean enough. So in the end, we use a __initdata variable, which is a variable that is released once initialization is complete, including both CPU and memory nodes for HMAT to iterate through. Link: https://lkml.kernel.org/r/20240704072646.437579-1-horen.chuang@linux.dev Signed-off-by:
Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Suggested-by:
Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Reviewed-by:
Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Len Brown <lenb@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Maarten Lankhorst authored
It's a lot of math, and there is nothing memcontrol specific about it. This makes it easier to use inside of the drm cgroup controller. [akpm@linux-foundation.org: fix kerneldoc, per Jeff Johnson] Link: https://lkml.kernel.org/r/20240703112510.36424-1-maarten.lankhorst@linux.intel.com Signed-off-by:
Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Acked-by:
Roman Gushchin <roman.gushchin@linux.dev> Acked-by:
Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
vmf_insert_mixed_mkwrite is only used by the built-in DAX code. Link: https://lkml.kernel.org/r/20240702072327.1640911-1-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Alistair Popple <apopple@nvidia.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Christophe Leroy authored
powerpc was the only user of CONFIG_ARCH_HAS_HUGEPD and doesn't use it anymore, so remove all related code. Link: https://lkml.kernel.org/r/4b10c54c794780b955f3ad6c657d0199dd792146.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by:
Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by:
Oscar Salvador <osalvador@suse.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-