Skip to content
Snippets Groups Projects
  1. Aug 08, 2024
  2. Jul 30, 2024
    • Rik van Riel's avatar
      mm, slub: do not call do_slab_free for kfence object · a371d558
      Rik van Riel authored
      
      In 782f8906 the freeing of kfence objects was moved from deep
      inside do_slab_free to the wrapper functions outside. This is a nice
      change, but unfortunately it missed one spot in __kmem_cache_free_bulk.
      
      This results in a crash like this:
      
      BUG skbuff_head_cache (Tainted: G S  B       E     ): Padding overwritten. 0xffff88907fea0f00-0xffff88907fea0fff @offset=3840
      
      slab_err (mm/slub.c:1129)
      free_to_partial_list (mm/slub.c:? mm/slub.c:4036)
      slab_pad_check (mm/slub.c:864 mm/slub.c:1290)
      check_slab (mm/slub.c:?)
      free_to_partial_list (mm/slub.c:3171 mm/slub.c:4036)
      kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4495 mm/slub.c:4586 mm/slub.c:4635)
      napi_build_skb (net/core/skbuff.c:348 net/core/skbuff.c:527 net/core/skbuff.c:549)
      
      All the other callers to do_slab_free appear to be ok.
      
      Add a kfence_free check in __kmem_cache_free_bulk to avoid the crash.
      
      Reported-by: default avatarChris Mason <clm@meta.com>
      Fixes: 782f8906 ("mm/slub: free KFENCE objects in slab_free_hook()")
      Cc: stable@kernel.org
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      a371d558
  3. Jul 28, 2024
    • Linus Torvalds's avatar
      minmax: make generic MIN() and MAX() macros available everywhere · 1a251f52
      Linus Torvalds authored
      
      This just standardizes the use of MIN() and MAX() macros, with the very
      traditional semantics.  The goal is to use these for C constant
      expressions and for top-level / static initializers, and so be able to
      simplify the min()/max() macros.
      
      These macro names were used by various kernel code - they are very
      traditional, after all - and all such users have been fixed up, with a
      few different approaches:
      
       - trivial duplicated macro definitions have been removed
      
         Note that 'trivial' here means that it's obviously kernel code that
         already included all the major kernel headers, and thus gets the new
         generic MIN/MAX macros automatically.
      
       - non-trivial duplicated macro definitions are guarded with #ifndef
      
         This is the "yes, they define their own versions, but no, the include
         situation is not entirely obvious, and maybe they don't get the
         generic version automatically" case.
      
       - strange use case #1
      
         A couple of drivers decided that the way they want to describe their
         versioning is with
      
      	#define MAJ 1
      	#define MIN 2
      	#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
      
         which adds zero value and I just did my Alexander the Great
         impersonation, and rewrote that pointless Gordian knot as
      
      	#define DRV_VERSION "1.2"
      
         instead.
      
       - strange use case #2
      
         A couple of drivers thought that it's a good idea to have a random
         'MIN' or 'MAX' define for a value or index into a table, rather than
         the traditional macro that takes arguments.
      
         These values were re-written as C enum's instead. The new
         function-line macros only expand when followed by an open
         parenthesis, and thus don't clash with enum use.
      
      Happily, there weren't really all that many of these cases, and a lot of
      users already had the pattern of using '#ifndef' guarding (or in one
      case just using '#undef MIN') before defining their own private version
      that does the same thing. I left such cases alone.
      
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a251f52
  4. Jul 26, 2024
    • Li Zhijian's avatar
      mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() · 66eca102
      Li Zhijian authored
      It's expected that no page should be left in pcp_list after calling
      zone_pcp_disable() in offline_pages().  Previously, it's observed that
      offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
      
      Cause:
      There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
      involving the pcp->count variable. See below scenario:
      
               CPU0                              CPU1
          ----------------                    ---------------
                                            spin_lock(&pcp->lock);
                                            __rmqueue_pcplist() {
      zone_pcp_disable() {
                                              /* list is empty */
                                              if (list_empty(list)) {
                                                /* add pages to pcp_list */
                                                alloced = rmqueue_bulk()
        mutex_lock(&pcp_batch_high_lock)
        ...
        __drain_all_pages() {
          drain_pages_zone() {
            /* read pcp->count, it's 0 here */
            count = READ_ONCE(pcp->count)
            /* 0 means nothing to drain */
                                                /* update pcp->count */
                                                pcp->count += alloced << order;
            ...
                                            ...
                                            spin_unlock(&pcp->lock);
      
      In this case, after calling zone_pcp_disable() though, there are still some
      pages in pcp_list. And these pages in pcp_list are neither movable nor
      isolated, offline_pages() gets stuck as a result.
      
      Solution:
      Expand the scope of the pcp->lock to also protect pcp->count in
      drain_pages_zone(), to ensure no pages are left in the pcp list after
      zone_pcp_disable()
      
      [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/
      
      Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com
      
      
      Fixes: 4b23a68f ("mm/page_alloc: protect PCP lists with a spinlock")
      Signed-off-by: default avatarLi Zhijian <lizhijian@fujitsu.com>
      Reported-by: default avatarYao Xingtao <yaoxt.fnst@fujitsu.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      66eca102
    • Suren Baghdasaryan's avatar
      alloc_tag: outline and export free_reserved_page() · b3bebe44
      Suren Baghdasaryan authored
      Outline and export free_reserved_page() because modules use it and it in
      turn uses page_ext_{get|put} which should not be exported.  The same
      result could be obtained by outlining {get|put}_page_tag_ref() but that
      would have higher performance impact as these functions are used in more
      performance critical paths.
      
      Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com
      
      
      Fixes: dcfe378c ("lib: introduce support for page allocation tagging")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/
      
      
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Cc: <stable@vger.kernel.org>	[6.10]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b3bebe44
    • Gavin Shan's avatar
      mm/huge_memory: avoid PMD-size page cache if needed · d659b715
      Gavin Shan authored
      xarray can't support arbitrary page cache size.  the largest and supported
      page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d9064
      ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray").  However,
      it's possible to have 512MB page cache in the huge memory's collapsing
      path on ARM64 system whose base page size is 64KB.  512MB page cache is
      breaking the limitation and a warning is raised when the xarray entry is
      split as shown in the following example.
      
      [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize
      KernelPageSize:       64 kB
      [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c
         :
      int main(int argc, char **argv)
      {
      	const char *filename = TEST_XFS_FILENAME;
      	int fd = 0;
      	void *buf = (void *)-1, *p;
      	int pgsize = getpagesize();
      	int ret = 0;
      
      	if (pgsize != 0x10000) {
      		fprintf(stdout, "System with 64KB base page size is required!\n");
      		return -EPERM;
      	}
      
      	system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb");
      	system("echo 1 > /proc/sys/vm/drop_caches");
      
      	/* Open the xfs file */
      	fd = open(filename, O_RDONLY);
      	assert(fd > 0);
      
      	/* Create VMA */
      	buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0);
      	assert(buf != (void *)-1);
      	fprintf(stdout, "mapped buffer at 0x%p\n", buf);
      
      	/* Populate VMA */
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE);
      	assert(ret == 0);
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ);
      	assert(ret == 0);
      
      	/* Collapse VMA */
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
      	assert(ret == 0);
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE);
      	if (ret) {
      		fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno);
      		goto out;
      	}
      
      	/* Split xarray entry. Write permission is needed */
      	munmap(buf, TEST_MEM_SIZE);
      	buf = (void *)-1;
      	close(fd);
      	fd = open(filename, O_RDWR);
      	assert(fd > 0);
      	fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
       		  TEST_MEM_SIZE - pgsize, pgsize);
      out:
      	if (buf != (void *)-1)
      		munmap(buf, TEST_MEM_SIZE);
      	if (fd > 0)
      		close(fd);
      
      	return ret;
      }
      
      [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test
      [root@dhcp-10-26-1-207 ~]# /tmp/test
       ------------[ cut here ]------------
       WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
       Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib    \
       nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct      \
       nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4      \
       ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse   \
       xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net  \
       sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio
       CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9
       Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
       pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
       pc : xas_split_alloc+0xf8/0x128
       lr : split_huge_page_to_list_to_order+0x1c4/0x780
       sp : ffff8000ac32f660
       x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0
       x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d
       x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000
       x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000
       x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000
       x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
       x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c
       x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8
       x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40
       x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
       Call trace:
        xas_split_alloc+0xf8/0x128
        split_huge_page_to_list_to_order+0x1c4/0x780
        truncate_inode_partial_folio+0xdc/0x160
        truncate_inode_pages_range+0x1b4/0x4a8
        truncate_pagecache_range+0x84/0xa0
        xfs_flush_unmap_range+0x70/0x90 [xfs]
        xfs_file_fallocate+0xfc/0x4d8 [xfs]
        vfs_fallocate+0x124/0x2f0
        ksys_fallocate+0x4c/0xa0
        __arm64_sys_fallocate+0x24/0x38
        invoke_syscall.constprop.0+0x7c/0xd8
        do_el0_svc+0xb4/0xd0
        el0_svc+0x44/0x1d8
        el0t_64_sync_handler+0x134/0x150
        el0t_64_sync+0x17c/0x180
      
      Fix it by correcting the supported page cache orders, different sets for
      DAX and other files.  With it corrected, 512MB page cache becomes
      disallowed on all non-DAX files on ARM64 system where the base page size
      is 64KB.  After this patch is applied, the test program fails with error
      -EINVAL returned from __thp_vma_allowable_orders() and the madvise()
      system call to collapse the page caches.
      
      Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com
      
      
      Fixes: 6b24ca4a ("mm: Use multi-index entries in the page cache")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d659b715
    • Yang Shi's avatar
      mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines · d9592025
      Yang Shi authored
      Yves-Alexis Perez reported commit 4ef9ad19 ("mm: huge_memory: don't
      force huge page alignment on 32 bit") didn't work for x86_32 [1].  It is
      because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT.
      
      !CONFIG_64BIT should cover all 32 bit machines.
      
      [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com
      
      
      Fixes: 4ef9ad19 ("mm: huge_memory: don't force huge page alignment on 32 bit")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avatarYves-Alexis Perez <corsac@debian.org>
      Tested-by: default avatarYves-Alexis Perez <corsac@debian.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Salvatore Bonaccorso <carnil@debian.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[6.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9592025
    • Ram Tummala's avatar
      mm: fix old/young bit handling in the faulting path · 4cd7ba16
      Ram Tummala authored
      Commit 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      replaced do_set_pte() with set_pte_range() and that introduced a
      regression in the following faulting path of non-anonymous vmas which
      caused the PTE for the faulting address to be marked as old instead of
      young.
      
      handle_pte_fault()
        do_pte_missing()
          do_fault()
            do_read_fault() || do_cow_fault() || do_shared_fault()
              finish_fault()
                set_pte_range()
      
      The polarity of prefault calculation is incorrect.  This leads to prefault
      being incorrectly set for the faulting address.  The following check will
      incorrectly mark the PTE old rather than young.  On some architectures
      this will cause a double fault to mark it young when the access is
      retried.
      
          if (prefault && arch_wants_old_prefaulted_pte())
              entry = pte_mkold(entry);
      
      On a subsequent fault on the same address, the faulting path will see a
      non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
      will then be correctly marked young in handle_pte_fault() itself.
      
      Due to this bug, performance degradation in the fault handling path will
      be observed due to unnecessary double faulting.
      
      Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
      
      
      Fixes: 3bd786f7 ("mm: convert do_set_pte() to set_pte_range()")
      Signed-off-by: default avatarRam Tummala <rtummala@nvidia.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4cd7ba16
  5. Jul 24, 2024
    • Joel Granados's avatar
      sysctl: treewide: constify the ctl_table argument of proc_handlers · 78eb4ea2
      Joel Granados authored
      
      const qualify the struct ctl_table argument in the proc_handler function
      signatures. This is a prerequisite to moving the static ctl_table
      structs into .rodata data which will ensure that proc_handler function
      pointers cannot be modified.
      
      This patch has been generated by the following coccinelle script:
      
      ```
        virtual patch
      
        @r1@
        identifier ctl, write, buffer, lenp, ppos;
        identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
        @r2@
        identifier func, ctl, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos)
        { ... }
      
        @r3@
        identifier func;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int , void *, size_t *, loff_t *);
      
        @r4@
        identifier func, ctl;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int , void *, size_t *, loff_t *);
      
        @r5@
        identifier func, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
      ```
      
      * Code formatting was adjusted in xfs_sysctl.c to comply with code
        conventions. The xfs_stats_clear_proc_handler,
        xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
        adjusted.
      
      * The ctl_table argument in proc_watchdog_common was const qualified.
        This is called from a proc_handler itself and is calling back into
        another proc_handler, making it necessary to change it as part of the
        proc_handler migration.
      
      Co-developed-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Co-developed-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      78eb4ea2
  6. Jul 19, 2024
    • Jason A. Donenfeld's avatar
      mm: add MAP_DROPPABLE for designating always lazily freeable mappings · 9651fced
      Jason A. Donenfeld authored
      
      The vDSO getrandom() implementation works with a buffer allocated with a
      new system call that has certain requirements:
      
      - It shouldn't be written to core dumps.
        * Easy: VM_DONTDUMP.
      - It should be zeroed on fork.
        * Easy: VM_WIPEONFORK.
      
      - It shouldn't be written to swap.
        * Uh-oh: mlock is rlimited.
        * Uh-oh: mlock isn't inherited by forks.
      
      - It shouldn't reserve actual memory, but it also shouldn't crash when
        page faulting in memory if none is available
        * Uh-oh: VM_NORESERVE means segfaults.
      
      It turns out that the vDSO getrandom() function has three really nice
      characteristics that we can exploit to solve this problem:
      
      1) Due to being wiped during fork(), the vDSO code is already robust to
         having the contents of the pages it reads zeroed out midway through
         the function's execution.
      
      2) In the absolute worst case of whatever contingency we're coding for,
         we have the option to fallback to the getrandom() syscall, and
         everything is fine.
      
      3) The buffers the function uses are only ever useful for a maximum of
         60 seconds -- a sort of cache, rather than a long term allocation.
      
      These characteristics mean that we can introduce VM_DROPPABLE, which
      has the following semantics:
      
      a) It never is written out to swap.
      b) Under memory pressure, mm can just drop the pages (so that they're
         zero when read back again).
      c) It is inherited by fork.
      d) It doesn't count against the mlock budget, since nothing is locked.
      e) If there's not enough memory to service a page fault, it's not fatal,
         and no signal is sent.
      
      This way, allocations used by vDSO getrandom() can use:
      
          VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
      
      And there will be no problem with OOMing, crashing on overcommitment,
      using memory when not in use, not wiping on fork(), coredumps, or
      writing out to swap.
      
      In order to let vDSO getrandom() use this, expose these via mmap(2) as
      MAP_DROPPABLE.
      
      Note that this involves removing the MADV_FREE special case from
      sort_folio(), which according to Yu Zhao is unnecessary and will simply
      result in an extra call to shrink_folio_list() in the worst case. The
      chunk removed reenables the swapbacked flag, which we don't want for
      VM_DROPPABLE, and we can't conditionalize it here because there isn't a
      vma reference available.
      
      Finally, the provided self test ensures that this is working as desired.
      
      Cc: linux-mm@kvack.org
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      9651fced
  7. Jul 18, 2024
    • Yu Zhao's avatar
      mm/mglru: fix ineffective protection calculation · 30d77b7e
      Yu Zhao authored
      mem_cgroup_calculate_protection() is not stateless and should only be used
      as part of a top-down tree traversal.  shrink_one() traverses the per-node
      memcg LRU instead of the root_mem_cgroup tree, and therefore it should not
      call mem_cgroup_calculate_protection().
      
      The existing misuse in shrink_one() can cause ineffective protection of
      sub-trees that are grandchildren of root_mem_cgroup.  Fix it by reusing
      lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to
      calculate the protection.
      
      Previously lru_gen_age_node() opportunistically skips the first pass,
      i.e., when scan_control->priority is DEF_PRIORITY.  On the second pass,
      lruvec_is_sizable() uses appropriate scan_control->priority, set by
      set_initial_priority() from lru_gen_shrink_node(), to decide whether a
      memcg is too small to reclaim from.
      
      Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree.
      So it should call set_initial_priority() upfront, to make sure
      lruvec_is_sizable() uses appropriate scan_control->priority on the first
      pass.  Otherwise, lruvec_is_reclaimable() can return false negatives and
      result in premature OOM kills when min_ttl_ms is used.
      
      Link: https://lkml.kernel.org/r/20240712232956.1427127-1-yuzhao@google.com
      
      
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30d77b7e
    • Dan Carpenter's avatar
      mm/zswap: fix a white space issue · b749cb0d
      Dan Carpenter authored
      We accidentally deleted a tab in commit f84152e9efc5 ("mm/zswap: use only
      one pool in zswap").  Add it back.
      
      Link: https://lkml.kernel.org/r/c15066a0-f061-42c9-b0f5-d60281d3d5d8@stanley.mountain
      
      
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b749cb0d
    • Miaohe Lin's avatar
      mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio · 1390a333
      Miaohe Lin authored
      A kernel crash was observed when migrating hugetlb folio:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000008
      PGD 0 P4D 0
      Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
      RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
      RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
      RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
      RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
      R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
      R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
      FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __folio_migrate_mapping+0x59e/0x950
       __migrate_folio.constprop.0+0x5f/0x120
       move_to_new_folio+0xfd/0x250
       migrate_pages+0x383/0xd70
       soft_offline_page+0x2ab/0x7f0
       soft_offline_page_store+0x52/0x90
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xb9/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f5b3a514887
      RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887
      RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001
      RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00
      
      It's because hugetlb folio is passed to __folio_undo_large_rmappable()
      unexpectedly.  large_rmappable flag is imperceptibly set to hugetlb folio
      since commit f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to
      use a folio").  Then commit be9581ea ("mm: fix crashes from deferred
      split racing folio migration") makes folio_migrate_mapping() call
      folio_undo_large_rmappable() triggering the bug.  Fix this issue by
      clearing large_rmappable flag for hugetlb folios.  They don't need that
      flag set anyway.
      
      Link: https://lkml.kernel.org/r/20240709120433.4136700-1-linmiaohe@huawei.com
      
      
      Fixes: f6a8dd98 ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio")
      Fixes: be9581ea ("mm: fix crashes from deferred split racing folio migration")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1390a333
    • Miaohe Lin's avatar
      mm/hugetlb: fix possible recursive locking detected warning · 667574e8
      Miaohe Lin authored
      When tries to demote 1G hugetlb folios, a lockdep warning is observed:
      
      ============================================
      WARNING: possible recursive locking detected
      6.10.0-rc6-00452-ga4d0275fa660-dirty #79 Not tainted
      --------------------------------------------
      bash/710 is trying to acquire lock:
      ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460
      
      but task is already holding lock:
      ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&h->resize_lock);
        lock(&h->resize_lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      4 locks held by bash/710:
       #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
       #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
       #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
       #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460
      
      stack backtrace:
      CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty #79
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x68/0xa0
       __lock_acquire+0x10f2/0x1ca0
       lock_acquire+0xbe/0x2d0
       __mutex_lock+0x6d/0x400
       demote_store+0x244/0x460
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xb9/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7fa61db14887
      RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887
      RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001
      RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
      R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00
       </TASK>
      
      Lockdep considers this an AA deadlock because the different resize_lock
      mutexes reside in the same lockdep class, but this is a false positive.
      Place them in distinct classes to avoid these warnings.
      
      Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com
      
      
      Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      667574e8
    • yangge's avatar
      mm/gup: clear the LRU flag of a page before adding to LRU batch · 33dfe920
      yangge authored
      If a large number of CMA memory are configured in system (for example,
      the CMA memory accounts for 50% of the system memory), starting a
      virtual virtual machine with device passthrough, it will call
      pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.  Normally
      if a page is present and in CMA area, pin_user_pages_remote() will
      migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM
      flag.  But the current code will cause the migration failure due to
      unexpected page refcounts, and eventually cause the virtual machine
      fail to start.
      
      If a page is added in LRU batch, its refcount increases one, remove the
      page from LRU batch decreases one.  Page migration requires the page is
      not referenced by others except page mapping.  Before migrating a page,
      we should try to drain the page from LRU batch in case the page is in
      it, however, folio_test_lru() is not sufficient to tell whether the
      page is in LRU batch or not, if the page is in LRU batch, the migration
      will fail.
      
      To solve the problem above, we modify the logic of adding to LRU batch.
      Before adding a page to LRU batch, we clear the LRU flag of the page
      so that we can check whether the page is in LRU batch by
      folio_test_lru(page).  It's quite valuable, because likely we don't
      want to blindly drain the LRU batch simply because there is some
      unexpected reference on a page, as described above.
      
      This change makes the LRU flag of a page invisible for longer, which
      may impact some programs.  For example, as long as a page is on a LRU
      batch, we cannot isolate it, and we cannot check if it's an LRU page. 
      Further, a page can now only be on exactly one LRU batch.  This doesn't
      seem to matter much, because a new page is allocated from buddy and
      added to the lru batch, or be isolated, it's LRU flag may also be
      invisible for a long time.
      
      Link: https://lkml.kernel.org/r/1720075944-27201-1-git-send-email-yangge1116@126.com
      Link: https://lkml.kernel.org/r/1720008153-16035-1-git-send-email-yangge1116@126.com
      
      
      Fixes: 9a4e9f3b ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
      Signed-off-by: default avataryangge <yangge1116@126.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      33dfe920
    • Tvrtko Ursulin's avatar
      mm/numa_balancing: teach mpol_to_str about the balancing mode · af649773
      Tvrtko Ursulin authored
      Since balancing mode was added in bda420b9 ("numa balancing: migrate
      on fault among multiple bound nodes"), it was possible to set this mode
      but it wouldn't be shown in /proc/<pid>/numa_maps since there was no
      support for it in the mpol_to_str() helper.
      
      Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it
      would be displayed as 'default' due a workaround introduced a few years
      earlier in 8790c71a ("mm/mempolicy.c: fix mempolicy printing in
      numa_maps").
      
      To tidy this up we implement two changes:
      
      Replace the MPOL_F_MORON check by pointer comparison against the
      preferred_node_policy array.  By doing this we generalise the current
      special casing and replace the incorrect 'default' with the correct 'bind'
      for the mode.
      
      Secondly, we add a string representation and corresponding handling for
      the MPOL_F_NUMA_BALANCING flag.
      
      With the two changes together we start showing the balancing flag when it
      is set and therefore complete the fix.
      
      Representation format chosen is to separate multiple flags with vertical
      bars, following what existed long time ago in kernel 2.6.25.  But as
      between then and now there wasn't a way to display multiple flags, this
      patch does not change the format in practice.
      
      Some /proc/<pid>/numa_maps output examples:
      
       555559580000 bind=balancing:0-1,3 file=...
       555585800000 bind=balancing|static:0,2 file=...
       555635240000 prefer=relative:0 file=
      
      Link: https://lkml.kernel.org/r/20240708075632.95857-1-tursulin@igalia.com
      
      
      Signed-off-by: default avatarTvrtko Ursulin <tvrtko.ursulin@igalia.com>
      Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes")
      References: 8790c71a ("mm/mempolicy.c: fix mempolicy printing in numa_maps")
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af649773
    • Roman Gushchin's avatar
      mm: memcg1: convert charge move flags to unsigned long long · 5316b497
      Roman Gushchin authored
      Currently MOVE_ANON and MOVE_FILE flags are defined as integers
      and it leads to the following Smatch static checker warning:
          mm/memcontrol-v1.c:609 mem_cgroup_move_charge_write()
          warn: was expecting a 64 bit value instead of '~(1 | 2)'
      
      Fix this be redefining them as unsigned long long.
      
      Even though the issue allows to set high 32 bits of mc.flags
      to an arbitrary number, these bits are never used, so it doesn't
      have any significant consequences.
      
      Link: https://lkml.kernel.org/r/ZpF8Q9zBsIY7d2P9@google.com
      
      
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5316b497
    • Yu Zhao's avatar
      mm/mglru: fix overshooting shrinker memory · 3f74e6bd
      Yu Zhao authored
      set_initial_priority() tries to jump-start global reclaim by estimating
      the priority based on cold/hot LRU pages.  The estimation does not account
      for shrinker objects, and it cannot do so because their sizes can be in
      different units other than page.
      
      If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
      ZFS ARC can use almost all system memory, set_initial_priority() can
      vastly underestimate how much memory ARC shrinker can evict and assign
      extreme low values to scan_control->priority, resulting in overshoots of
      shrinker objects.
      
      To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
      test ZFS pool and the following commands:
      
        fio --name=mglru.file --numjobs=36 --ioengine=io_uring \
            --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \
            --rw=randread --random_distribution=random \
            --time_based --runtime=1h &
      
        for ((i = 0; i < 20; i++))
        do
          sleep 120
          fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
            --filename=/dev/zero --size=1024m --fadvise_hint=0 \
            --rw=randrw --random_distribution=random \
            --time_based --runtime=1m
        done
      
      To fix the problem:
      1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
         the jump-start from being overly aggressive.
      2. Account for the progress from mm_account_reclaimed_pages(), to
         prevent kswapd_shrink_node() from raising the priority
         unnecessarily.
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com
      
      
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAlexander Motin <mav@ixsystems.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f74e6bd
    • Yu Zhao's avatar
      mm/mglru: fix div-by-zero in vmpressure_calc_level() · 8b671fe1
      Yu Zhao authored
      evict_folios() uses a second pass to reclaim folios that have gone through
      page writeback and become clean before it finishes the first pass, since
      folio_rotate_reclaimable() cannot handle those folios due to the
      isolation.
      
      The second pass tries to avoid potential double counting by deducting
      scan_control->nr_scanned.  However, this can result in underflow of
      nr_scanned, under a condition where shrink_folio_list() does not increment
      nr_scanned, i.e., when folio_trylock() fails.
      
      The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
      vmpressure_calc_level(), to become zero, resulting in the following crash:
      
        [exception RIP: vmpressure_work_fn+101]
        process_one_work at ffffffffa3313f2b
      
      Since scan_control->nr_scanned has no established semantics, the potential
      double counting has minimal risks.  Therefore, fix the problem by not
      deducting scan_control->nr_scanned in evict_folios().
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com
      
      
      Fixes: 359a5e14 ("mm: multi-gen LRU: retry folios written back while isolated")
      Reported-by: default avatarWei Xu <weixugc@google.com>
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Alexander Motin <mav@ixsystems.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b671fe1
    • Kees Cook's avatar
      mm/kmemleak: replace strncpy() with strscpy() · 0b847801
      Kees Cook authored
      Replace the depreciated[1] strncpy() calls with strscpy().  Uses of
      object->comm do not depend on the padding side-effect.
      
      Link: https://github.com/KSPP/linux/issues/90 [1]
      Link: https://lkml.kernel.org/r/20240710001300.work.004-kees@kernel.org
      
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b847801
    • Vlastimil Babka's avatar
      mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC · 53dabce2
      Vlastimil Babka authored
      This mostly reverts commit af3b8544 ("mm/page_alloc.c: allow error
      injection").  The commit made should_fail_alloc_page() a noinline function
      that's always called from the page allocation hotpath, even if it's empty
      because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to
      disable it and prevent the associated function call overhead.
      
      As with the preceding patch "mm, slab: put should_failslab back behind
      CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the
      should_fail_alloc_page() back behind the config option.  When enabled, the
      ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a
      complete revert.
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53dabce2
    • Vlastimil Babka's avatar
      mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB · a7526fe8
      Vlastimil Babka authored
      Patch series "revert unconditional slab and page allocator fault injection
      calls".
      
      These two patches largely revert commits that added function call overhead
      into slab and page allocation hotpaths and that cannot be currently
      disabled even though related CONFIG_ options do exist.
      
      A much more involved solution that can keep the callsites always existing
      but hidden behind a static key if unused, is possible [1] and can be
      pursued by anyone who believes it's necessary.  Meanwhile the fact the
      should_failslab() error injection is already not functional on kernels
      built with current gcc without anyone noticing [2], and lukewarm response
      to [1] suggests the need is not there.  I believe it will be more fair to
      have the state after this series as a baseline for possible further
      optimisation, instead of the unconditional overhead.
      
      For example a possible compromise for anyone who's fine with an empty
      function call overhead but not the full CONFIG_FAILSLAB /
      CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a
      static key check only inside should_failslab() and
      should_fail_alloc_page() before performing the more expensive checks.
      
      [1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t
      [2] https://github.com/bpftrace/bpftrace/issues/3258
      
      
      This patch (of 2):
      
      This mostly reverts commit 4f6923fb ("mm: make should_failslab always
      available for fault injection").  The commit made should_failslab() a
      noinline function that's always called from the slab allocation hotpath,
      even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and
      there is no option to disable that call.  This is visible in profiles and
      the function call overhead can be noticeable especially with cpu
      mitigations.
      
      Meanwhile the bpftrace program example in the commit silently does not
      work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the
      empty function gets a .constprop clone that is actually being called
      (uselessly) from the slab hotpath, while the error injection is hooked to
      the original function that's not being called at all [1].
      
      Thus put the whole should_failslab() function back behind
      CONFIG_SHOULD_FAILSLAB.  It's not a complete revert of 4f6923fb - the
      int return type that returns -ENOMEM on failure is preserved, as well
      ALLOW_ERROR_INJECTION annotation.  The BTF_ID() record that was meanwhile
      added is also guarded by CONFIG_SHOULD_FAILSLAB.
      
      [1] https://github.com/bpftrace/bpftrace/issues/3258
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7526fe8
    • Pei Li's avatar
      mm: ignore data-race in __swap_writepage · 7b7aca6d
      Pei Li authored
      Syzbot reported a possible data race:
      
      BUG: KCSAN: data-race in __swap_writepage / scan_swap_map_slots
      
      read-write to 0xffff888102fca610 of 8 bytes by task 7106 on cpu 1.
      read to 0xffff888102fca610 of 8 bytes by task 7080 on cpu 0.
      
      While we are in __swap_writepage to read sis->flags, scan_swap_map_slots
      is trying to update it with SWP_SCANNING.
      
      value changed: 0x0000000000008083 -> 0x0000000000004083.
      
      While this can be updated non-atomicially, this won't affect
      SWP_SYNCHRONOUS_IO, so we consider this data-race safe.
      
      This is possibly introduced by commit 3222d8c2 ("block: remove
      ->rw_page"), where this if branch is introduced.
      
      Link: https://lkml.kernel.org/r/20240711-bug13-v1-1-cea2b8ae8d76@gmail.com
      
      
      Fixes: 3222d8c2 ("block: remove ->rw_page")
      Signed-off-by: default avatarPei Li <peili.dev@gmail.com>
      Reported-by: default avatar <syzbot+da25887cc13da6bf3b8c@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=da25887cc13da6bf3b8c
      
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b7aca6d
  8. Jul 15, 2024
    • Alex Shi (Tencent)'s avatar
      mm/memcg: alignment memcg_data define condition · a52c6330
      Alex Shi (Tencent) authored
      
      commit 21c690a3 ("mm: introduce slabobj_ext to support slab object
      extensions") changed the folio/page->memcg_data define condition from
      MEMCG to SLAB_OBJ_EXT. This action make memcg_data exposed while !MEMCG.
      
      As Vlastimil Babka suggested, let's add _unused_slab_obj_exts for
      SLAB_MATCH for slab.obj_exts while !MEMCG. That could resolve the match
      issue, clean up the feature logical.
      
      Signed-off-by: default avatarAlex Shi (Tencent) <alexs@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yoann Congal <yoann.congal@smile.fr>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      a52c6330
  9. Jul 12, 2024
Loading