Skip to content
Snippets Groups Projects
  1. Mar 03, 2025
    • David Hildenbrand's avatar
      mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize() · ddfbbebc
      David Hildenbrand authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 41cddf83d8b00f29fd105e7a0777366edc69a5cf upstream.
      
      If migration succeeded, we called
      folio_migrate_flags()->mem_cgroup_migrate() to migrate the memcg from the
      old to the new folio.  This will set memcg_data of the old folio to 0.
      
      Similarly, if migration failed, memcg_data of the dst folio is left unset.
      
      If we call folio_putback_lru() on such folios (memcg_data == 0), we will
      add the folio to be freed to the LRU, making memcg code unhappy.  Running
      the hmm selftests:
      
        # ./hmm-tests
        ...
        #  RUN           hmm.hmm_device_private.migrate ...
        [  102.078007][T14893] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff27d200 pfn:0x13cc00
        [  102.079974][T14893] anon flags: 0x17ff00000020018(uptodate|dirty|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
        [  102.082037][T14893] raw: 017ff00000020018 dead000000000100 dead000000000122 ffff8881353896c9
        [  102.083687][T14893] raw: 00000007ff27d200 0000000000000000 00000001ffffffff 0000000000000000
        [  102.085331][T14893] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled())
        [  102.087230][T14893] ------------[ cut here ]------------
        [  102.088279][T14893] WARNING: CPU: 0 PID: 14893 at ./include/linux/memcontrol.h:726 folio_lruvec_lock_irqsave+0x10e/0x170
        [  102.090478][T14893] Modules linked in:
        [  102.091244][T14893] CPU: 0 UID: 0 PID: 14893 Comm: hmm-tests Not tainted 6.13.0-09623-g6c216bc522fd #151
        [  102.093089][T14893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
        [  102.094848][T14893] RIP: 0010:folio_lruvec_lock_irqsave+0x10e/0x170
        [  102.096104][T14893] Code: ...
        [  102.099908][T14893] RSP: 0018:ffffc900236c37b0 EFLAGS: 00010293
        [  102.101152][T14893] RAX: 0000000000000000 RBX: ffffea0004f30000 RCX: ffffffff8183f426
        [  102.102684][T14893] RDX: ffff8881063cb880 RSI: ffffffff81b8117f RDI: ffff8881063cb880
        [  102.104227][T14893] RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
        [  102.105757][T14893] R10: 0000000000000001 R11: 0000000000000002 R12: ffffc900236c37d8
        [  102.107296][T14893] R13: ffff888277a2bcb0 R14: 000000000000001f R15: 0000000000000000
        [  102.108830][T14893] FS:  00007ff27dbdd740(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000
        [  102.110643][T14893] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  102.111924][T14893] CR2: 00007ff27d400000 CR3: 000000010866e000 CR4: 0000000000750ef0
        [  102.113478][T14893] PKRU: 55555554
        [  102.114172][T14893] Call Trace:
        [  102.114805][T14893]  <TASK>
        [  102.115397][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
        [  102.116547][T14893]  ? __warn.cold+0x110/0x210
        [  102.117461][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
        [  102.118667][T14893]  ? report_bug+0x1b9/0x320
        [  102.119571][T14893]  ? handle_bug+0x54/0x90
        [  102.120494][T14893]  ? exc_invalid_op+0x17/0x50
        [  102.121433][T14893]  ? asm_exc_invalid_op+0x1a/0x20
        [  102.122435][T14893]  ? __wake_up_klogd.part.0+0x76/0xd0
        [  102.123506][T14893]  ? dump_page+0x4f/0x60
        [  102.124352][T14893]  ? folio_lruvec_lock_irqsave+0x10e/0x170
        [  102.125500][T14893]  folio_batch_move_lru+0xd4/0x200
        [  102.126577][T14893]  ? __pfx_lru_add+0x10/0x10
        [  102.127505][T14893]  __folio_batch_add_and_move+0x391/0x720
        [  102.128633][T14893]  ? __pfx_lru_add+0x10/0x10
        [  102.129550][T14893]  folio_putback_lru+0x16/0x80
        [  102.130564][T14893]  migrate_device_finalize+0x9b/0x530
        [  102.131640][T14893]  dmirror_migrate_to_device.constprop.0+0x7c5/0xad0
        [  102.133047][T14893]  dmirror_fops_unlocked_ioctl+0x89b/0xc80
      
      Likely, nothing else goes wrong: putting the last folio reference will
      remove the folio from the LRU again.  So besides memcg complaining, adding
      the folio to be freed to the LRU is just an unnecessary step.
      
      The new flow resembles what we have in migrate_folio_move(): add the dst
      to the lru, remove migration ptes, unlock and unref dst.
      
      Link: https://lkml.kernel.org/r/20250210161317.717936-1-david@redhat.com
      
      
      Fixes: 8763cb45 ("mm/migrate: new memory migration helper for use with device memory")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddfbbebc
    • Liu Shixin's avatar
      mm/compaction: fix UBSAN shift-out-of-bounds warning · 94533e0e
      Liu Shixin authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit d1366e74342e75555af2648a2964deb2d5c92200 upstream.
      
      syzkaller reported a UBSAN shift-out-of-bounds warning of (1UL << order)
      in isolate_freepages_block().  The bogus compound_order can be any value
      because it is union with flags.  Add back the MAX_PAGE_ORDER check to fix
      the warning.
      
      Link: https://lkml.kernel.org/r/20250123021029.2826736-1-liushixin2@huawei.com
      
      
      Fixes: 3da0272a ("mm/compaction: correctly return failure with bogus compound_order in strict mode")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kemeng Shi <shikemeng@huaweicloud.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94533e0e
    • Ritesh Harjani (IBM)'s avatar
      mm/hugetlb: fix hugepage allocation for interleaved memory nodes · ba50d9e1
      Ritesh Harjani (IBM) authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 76e961157e078bc5d3cd2df08317e00b00a829eb upstream.
      
      gather_bootmem_prealloc() assumes the start nid as 0 and size as
      num_node_state(N_MEMORY).  That means in case if memory attached numa
      nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail
      to scan few of these nodes.
      
      Since memory attached numa nodes can be interleaved in any fashion, hence
      ensure that the current code checks for all numa node ids
      (.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that
      it can distributes all nr_node_ids among the these many no. threads.
      
      e.g. qemu cmdline
      ========================
      numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
      mem_cmd="-object memory-backend-ram,id=mem1,size=16G"
      
      w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
      ==========================
      ~ # cat /proc/meminfo  |grep -i huge
      AnonHugePages:         0 kB
      ShmemHugePages:        0 kB
      FileHugePages:         0 kB
      HugePages_Total:       0
      HugePages_Free:        0
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:    1048576 kB
      Hugetlb:               0 kB
      
      with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
      ===========================
      ~ # cat /proc/meminfo |grep -i huge
      AnonHugePages:         0 kB
      ShmemHugePages:        0 kB
      FileHugePages:         0 kB
      HugePages_Total:       2
      HugePages_Free:        2
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:    1048576 kB
      Hugetlb:         2097152 kB
      
      Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com
      
      
      Fixes: b78b27d0 ("hugetlb: parallelize 1G hugetlb initialization")
      Signed-off-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Reported-by: default avatarPavithra Prakash <pavrampu@linux.ibm.com>
      Suggested-by: default avatarMuchun Song <muchun.song@linux.dev>
      Tested-by: default avatarSourabh Jain <sourabhjain@linux.ibm.com>
      Reviewed-by: default avatarLuiz Capitulino <luizcap@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Donet Tom <donettom@linux.ibm.com>
      Cc: Gang Li <gang.li@linux.dev>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba50d9e1
    • Li Zhijian's avatar
      mm/vmscan: accumulate nr_demoted for accurate demotion statistics · 58cba289
      Li Zhijian authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit a479b078fddb0ad7f9e3c6da22d9cf8f2b5c7799 upstream.
      
      In shrink_folio_list(), demote_folio_list() can be called 2 times.
      Currently stat->nr_demoted will only store the last nr_demoted( the later
      nr_demoted is always zero, the former nr_demoted will get lost), as a
      result number of demoted pages is not accurate.
      
      Accumulate the nr_demoted count across multiple calls to
      demote_folio_list(), ensuring accurate reporting of demotion statistics.
      
      [lizhijian@fujitsu.com: introduce local nr_demoted to fix nr_reclaimed double counting]
        Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com
      Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com
      
      
      Fixes: f77f0c75 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
      Signed-off-by: default avatarLi Zhijian <lizhijian@fujitsu.com>
      Acked-by: default avatarKaiyang Zhao <kaiyang2@cs.cmu.edu>
      Tested-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Reviewed-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58cba289
    • Zhaoyang Huang's avatar
      mm: gup: fix infinite loop within __get_longterm_locked · 300d5664
      Zhaoyang Huang authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 1aaf8c122918aa8897605a9aa1e8ed6600d6f930 upstream.
      
      We can run into an infinite loop in __get_longterm_locked() when
      collect_longterm_unpinnable_folios() finds only folios that are isolated
      from the LRU or were never added to the LRU.  This can happen when all
      folios to be pinned are never added to the LRU, for example when
      vm_ops->fault allocated pages using cma_alloc() and never added them to
      the LRU.
      
      Fix it by simply taking a look at the list in the single caller, to see if
      anything was added.
      
      [zhaoyang.huang@unisoc.com: move definition of local]
        Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com
      Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com
      
      
      Fixes: 67e139b0 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Aijun Sun <aijun.sun@unisoc.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      300d5664
    • Catalin Marinas's avatar
      mm: kmemleak: fix upper boundary check for physical address objects · fb20631b
      Catalin Marinas authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 488b5b9eca68497b533ced059be5eff19578bbca upstream.
      
      Memblock allocations are registered by kmemleak separately, based on their
      physical address.  During the scanning stage, it checks whether an object
      is within the min_low_pfn and max_low_pfn boundaries and ignores it
      otherwise.
      
      With the recent addition of __percpu pointer leak detection (commit
      6c99d4eb ("kmemleak: enable tracking for percpu pointers")), kmemleak
      started reporting leaks in setup_zone_pageset() and
      setup_per_cpu_pageset().  These were caused by the node_data[0] object
      (initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn)
      boundary.  The non-strict upper boundary check introduced by commit
      84c32629 ("mm: kmemleak: check physical address when scan") causes the
      pg_data_t object to be ignored (not scanned) and the __percpu pointers it
      contains to be reported as leaks.
      
      Make the max_low_pfn upper boundary check strict when deciding whether to
      ignore a physical address object and not scan it.
      
      Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com
      
      
      Fixes: 84c32629 ("mm: kmemleak: check physical address when scan")
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
      Cc: <stable@vger.kernel.org>	[6.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb20631b
    • Peter Xu's avatar
      mm/hugetlb: fix avoid_reserve to allow taking folio from subpool · cba520ba
      Peter Xu authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 58db7c5fbe7daa42098d4965133a864f98ba90ba upstream.
      
      Patch series "mm/hugetlb: Refactor hugetlb allocation resv accounting",
      v2.
      
      This is a follow up on Ackerley's series here as replacement:
      
      https://lore.kernel.org/r/cover.1728684491.git.ackerleytng@google.com
      
      The goal of this series is to cleanup hugetlb resv accounting, especially
      during folio allocation, to decouple a few things:
      
        - Hugetlb folios v.s. Hugetlbfs: IOW, the hope is in the future hugetlb
          folios can be allocated completely without hugetlbfs.
      
        - Decouple VMA v.s. hugetlb folio allocations: allocating a hugetlb folio
          should not always require a hugetlbfs VMA.  For example, either it got
          allocated from the inode level (see hugetlbfs_fallocate() where it used
          a pesudo VMA for allocation), or it can be allocated by other kernel
          subsystems.
      
      It paves way for other users to allocate hugetlb folios out of either
      system reservations, or subpools (instead of hugetlbfs, as a file system).
      For longer term, this prepares hugetlb as a separate concept versus
      hugetlbfs, so that hugetlb folios can be allocated by not only hugetlbfs
      and other things.
      
      Tests I've done:
      
      - I had a reproducer in patch 1 for the bug I found, this will start to
        work after patch 1 or the whole set applied.
      
      - Hugetlb regression tests (on x86_64 2MBs), includes:
      
        - All vmtests on hugetlbfs
      
        - libhugetlbfs test suite (which may fail some tests, but no new failures
          will be introduced by this series, so all such failures happen before
          this series so shouldn't be relevant).
      
      
      This patch (of 7):
      
      Since commit 04f2cbe3 ("hugetlb: guarantee that COW faults for a
      process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
      avoid_reserve was introduced for a special case of CoW on hugetlb private
      mappings, and only if the owner VMA is trying to allocate yet another
      hugetlb folio that is not reserved within the private vma reserved map.
      
      Later on, in commit d85f69b0 ("mm/hugetlb: alloc_huge_page handle
      areas hole punched by fallocate"), alloc_huge_page() enforced to not
      consume any global reservation as long as avoid_reserve=true.  This
      operation doesn't look correct, because even if it will enforce the
      allocation to not use global reservation at all, it will still try to take
      one reservation from the spool (if the subpool existed).  Then since the
      spool reserved pages take from global reservation, it'll also take one
      reservation globally.
      
      Logically it can cause global reservation to go wrong.
      
      I wrote a reproducer below, trigger this special path, and every run of
      such program will cause global reservation count to increment by one, until
      it hits the number of free pages:
      
        #define _GNU_SOURCE             /* See feature_test_macros(7) */
        #include <stdio.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <unistd.h>
        #include <stdlib.h>
        #include <sys/mman.h>
      
        #define  MSIZE  (2UL << 20)
      
        int main(int argc, char *argv[])
        {
            const char *path;
            int *buf;
            int fd, ret;
            pid_t child;
      
            if (argc < 2) {
                printf("usage: %s <hugetlb_file>\n", argv[0]);
                return -1;
            }
      
            path = argv[1];
      
            fd = open(path, O_RDWR | O_CREAT, 0666);
            if (fd < 0) {
                perror("open failed");
                return -1;
            }
      
            ret = fallocate(fd, 0, 0, MSIZE);
            if (ret != 0) {
                perror("fallocate");
                return -1;
            }
      
            buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
      
            if (buf == MAP_FAILED) {
                perror("mmap() failed");
                return -1;
            }
      
            /* Allocate a page */
            *buf = 1;
      
            child = fork();
            if (child == 0) {
                /* child doesn't need to do anything */
                exit(0);
            }
      
            /* Trigger CoW from owner */
            *buf = 2;
      
            munmap(buf, MSIZE);
            close(fd);
            unlink(path);
      
            return 0;
        }
      
      It can only reproduce with a sub-mount when there're reserved pages on the
      spool, like:
      
        # sysctl vm.nr_hugepages=128
        # mkdir ./hugetlb-pool
        # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool
      
      Then run the reproducer on the mountpoint:
      
        # ./reproducer ./hugetlb-pool/test
      
      Fix it by taking the reservation from spool if available.  In general,
      avoid_reserve is IMHO more about "avoid vma resv map", not spool's.
      
      I copied stable, however I have no intention for backporting if it's not a
      clean cherry-pick, because private hugetlb mapping, and then fork() on top
      is too rare to hit.
      
      Link: https://lkml.kernel.org/r/20250107204002.2683356-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20250107204002.2683356-2-peterx@redhat.com
      
      
      Fixes: d85f69b0 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Tested-by: default avatarAckerley Tng <ackerleytng@google.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cba520ba
    • Marco Elver's avatar
      kfence: skip __GFP_THISNODE allocations on NUMA systems · 32257e73
      Marco Elver authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit e64f81946adf68cd75e2207dd9a51668348a4af8 upstream.
      
      On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on
      a particular node, and failure to allocate on the desired node will result
      in a failed allocation.
      
      Skip __GFP_THISNODE allocations if we are running on a NUMA system, since
      KFENCE can't guarantee which node its pool pages are allocated on.
      
      Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com
      
      
      Fixes: 236e9f15 ("kfence: skip all GFP_ZONEMASK allocations")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Chistoph Lameter <cl@linux.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32257e73
    • Chen Ridong's avatar
      memcg: fix soft lockup in the OOM process · 0f8f6c7a
      Chen Ridong authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit ade81479c7dda1ce3eedb215c78bc615bbd04f06 upstream.
      
      A soft lockup issue was found in the product with about 56,000 tasks were
      in the OOM cgroup, it was traversing them when the soft lockup was
      triggered.
      
      watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [VM Thread:1503066]
      CPU: 2 PID: 1503066 Comm: VM Thread Kdump: loaded Tainted: G
      Hardware name: Huawei Cloud OpenStack Nova, BIOS
      RIP: 0010:console_unlock+0x343/0x540
      RSP: 0000:ffffb751447db9a0 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000ffffffff
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000247
      RBP: ffffffffafc71f90 R08: 0000000000000000 R09: 0000000000000040
      R10: 0000000000000080 R11: 0000000000000000 R12: ffffffffafc74bd0
      R13: ffffffffaf60a220 R14: 0000000000000247 R15: 0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f2fe6ad91f0 CR3: 00000004b2076003 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       vprintk_emit+0x193/0x280
       printk+0x52/0x6e
       dump_task+0x114/0x130
       mem_cgroup_scan_tasks+0x76/0x100
       dump_header+0x1fe/0x210
       oom_kill_process+0xd1/0x100
       out_of_memory+0x125/0x570
       mem_cgroup_out_of_memory+0xb5/0xd0
       try_charge+0x720/0x770
       mem_cgroup_try_charge+0x86/0x180
       mem_cgroup_try_charge_delay+0x1c/0x40
       do_anonymous_page+0xb5/0x390
       handle_mm_fault+0xc4/0x1f0
      
      This is because thousands of processes are in the OOM cgroup, it takes a
      long time to traverse all of them.  As a result, this lead to soft lockup
      in the OOM process.
      
      To fix this issue, call 'cond_resched' in the 'mem_cgroup_scan_tasks'
      function per 1000 iterations.  For global OOM, call
      'touch_softlockup_watchdog' per 1000 iterations to avoid this issue.
      
      Link: https://lkml.kernel.org/r/20241224025238.3768787-1-chenridong@huaweicloud.com
      
      
      Fixes: 9cbb78bb ("mm, memcg: introduce own oom handler to iterate only over its own threads")
      Signed-off-by: default avatarChen Ridong <chenridong@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f8f6c7a
  2. Feb 01, 2025
  3. Jan 23, 2025
  4. Jan 09, 2025
    • Liu Shixin's avatar
      mm: hugetlb: independent PMD page table shared count · 2e31443a
      Liu Shixin authored
      commit 59d9094df3d79443937add8700b2ef1a866b1081 upstream.
      
      The folio refcount may be increased unexpectly through try_get_folio() by
      caller such as split_huge_pages.  In huge_pmd_unshare(), we use refcount
      to check whether a pmd page table is shared.  The check is incorrect if
      the refcount is increased by the above caller, and this can cause the page
      table leaked:
      
       BUG: Bad page state in process sh  pfn:109324
       page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
       flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
       page_type: f2(table)
       raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
       raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
       page dumped because: nonzero mapcount
       ...
       CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G    B              6.13.0-rc2master+ #7
       Tainted: [B]=BAD_PAGE
       Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
       Call trace:
        show_stack+0x20/0x38 (C)
        dump_stack_lvl+0x80/0xf8
        dump_stack+0x18/0x28
        bad_page+0x8c/0x130
        free_page_is_bad_report+0xa4/0xb0
        free_unref_page+0x3cc/0x620
        __folio_put+0xf4/0x158
        split_huge_pages_all+0x1e0/0x3e8
        split_huge_pages_write+0x25c/0x2d8
        full_proxy_write+0x64/0xd8
        vfs_write+0xcc/0x280
        ksys_write+0x70/0x110
        __arm64_sys_write+0x24/0x38
        invoke_syscall+0x50/0x120
        el0_svc_common.constprop.0+0xc8/0xf0
        do_el0_svc+0x24/0x38
        el0_svc+0x34/0x128
        el0t_64_sync_handler+0xc8/0xd0
        el0t_64_sync+0x190/0x198
      
      The issue may be triggered by damon, offline_page, page_idle, etc, which
      will increase the refcount of page table.
      
      1. The page table itself will be discarded after reporting the
         "nonzero mapcount".
      
      2. The HugeTLB page mapped by the page table miss freeing since we
         treat the page table as shared and a shared page table will not be
         unmapped.
      
      Fix it by introducing independent PMD page table shared count.  As
      described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390
      gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv
      pmds, so we can reuse the field as pt_share_count.
      
      Link: https://lkml.kernel.org/r/20241216071147.3984217-1-liushixin2@huawei.com
      
      
      Fixes: 39dde65c ("[PATCH] shared page table for hugetlb page")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Ken Chen <kenneth.w.chen@intel.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e31443a
    • Lorenzo Stoakes's avatar
      mm: reinstate ability to map write-sealed memfd mappings read-only · 464770df
      Lorenzo Stoakes authored
      commit 8ec396d05d1b737c87311fb7311f753b02c2a6b1 upstream.
      
      Patch series "mm: reinstate ability to map write-sealed memfd mappings
      read-only".
      
      In commit 15897894 ("mm: perform the mapping_map_writable() check
      after call_mmap()") (and preceding changes in the same series) it became
      possible to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
      
      Commit 5de19506 ("mm: resolve faulty mmap_region() error path
      behaviour") unintentionally undid this logic by moving the
      mapping_map_writable() check before the shmem_mmap() hook is invoked,
      thereby regressing this change.
      
      This series reworks how we both permit write-sealed mappings being mapped
      read-only and disallow mprotect() from undoing the write-seal, fixing this
      regression.
      
      We also add a regression test to ensure that we do not accidentally
      regress this in future.
      
      Thanks to Julian Orth for reporting this regression.
      
      
      This patch (of 2):
      
      In commit 15897894 ("mm: perform the mapping_map_writable() check
      after call_mmap()") (and preceding changes in the same series) it became
      possible to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
      
      This was previously unnecessarily disallowed, despite the man page
      documentation indicating that it would be, thereby limiting the usefulness
      of F_SEAL_WRITE logic.
      
      We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
      seal (one which disallows future writes to the memfd) to also be used for
      F_SEAL_WRITE.
      
      For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
      read-only mapping to disallow mprotect() from overriding the seal - an
      operation performed by seal_check_write(), invoked from shmem_mmap(), the
      f_op->mmap() hook used by shmem mappings.
      
      By extending this to F_SEAL_WRITE and critically - checking
      mapping_map_writable() to determine if we may map the memfd AFTER we
      invoke shmem_mmap() - the desired logic becomes possible.  This is because
      mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
      have cleared.
      
      Commit 5de19506 ("mm: resolve faulty mmap_region() error path
      behaviour") unintentionally undid this logic by moving the
      mapping_map_writable() check before the shmem_mmap() hook is invoked,
      thereby regressing this change.
      
      We reinstate this functionality by moving the check out of shmem_mmap()
      and instead performing it in do_mmap() at the point at which VMA flags are
      being determined, which seems in any case to be a more appropriate place
      in which to make this determination.
      
      In order to achieve this we rework memfd seal logic to allow us access to
      this information using existing logic and eliminate the clearing of
      VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
      instead.
      
      Link: https://lkml.kernel.org/r/99fc35d2c62bd2e05571cf60d9f8b843c56069e0.1732804776.git.lorenzo.stoakes@oracle.com
      
      
      Fixes: 5de19506 ("mm: resolve faulty mmap_region() error path behaviour")
      Signed-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reported-by: default avatarJulian Orth <ju.orth@gmail.com>
      Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@mail.gmail.com/
      
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      464770df
    • Seiji Nishikawa's avatar
      mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim() · 58d0d02d
      Seiji Nishikawa authored
      commit 6aaced5abd32e2a57cd94fd64f824514d0361da8 upstream.
      
      The task sometimes continues looping in throttle_direct_reclaim() because
      allow_direct_reclaim(pgdat) keeps returning false.
      
       #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
       #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
       #2 [ffff80002cb6f990] schedule at ffff800008abc50c
       #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
       #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
       #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
       #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
       #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
       #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
       #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4
      
      At this point, the pgdat contains the following two zones:
      
              NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
                SIZE: 20480  MIN/LOW/HIGH: 11/28/45
                VM_STAT:
                      NR_FREE_PAGES: 359
              NR_ZONE_INACTIVE_ANON: 18813
                NR_ZONE_ACTIVE_ANON: 0
              NR_ZONE_INACTIVE_FILE: 50
                NR_ZONE_ACTIVE_FILE: 0
                NR_ZONE_UNEVICTABLE: 0
              NR_ZONE_WRITE_PENDING: 0
                           NR_MLOCK: 0
                          NR_BOUNCE: 0
                         NR_ZSPAGES: 0
                  NR_FREE_CMA_PAGES: 0
      
              NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
                SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
                VM_STAT:
                      NR_FREE_PAGES: 146
              NR_ZONE_INACTIVE_ANON: 94668
                NR_ZONE_ACTIVE_ANON: 3
              NR_ZONE_INACTIVE_FILE: 735
                NR_ZONE_ACTIVE_FILE: 78
                NR_ZONE_UNEVICTABLE: 0
              NR_ZONE_WRITE_PENDING: 0
                           NR_MLOCK: 0
                          NR_BOUNCE: 0
                         NR_ZSPAGES: 0
                  NR_FREE_CMA_PAGES: 0
      
      In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
      inactive/active file-backed pages calculated in zone_reclaimable_pages()
      based on the result of zone_page_state_snapshot() is zero.
      
      Additionally, since this system lacks swap, the calculation of inactive/
      active anonymous pages is skipped.
      
              crash> p nr_swap_pages
              nr_swap_pages = $1937 = {
                counter = 0
              }
      
      As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
      the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
      free pages significantly exceeding the high watermark.
      
      The problem is that the pgdat->kswapd_failures hasn't been incremented.
      
              crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
              $1935 = 0x0
      
      This is because the node deemed balanced.  The node balancing logic in
      balance_pgdat() evaluates all zones collectively.  If one or more zones
      (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
      entire node is deemed balanced.  This causes balance_pgdat() to exit early
      before incrementing the kswapd_failures, as it considers the overall
      memory state acceptable, even though some zones (like ZONE_NORMAL) remain
      under significant pressure.
      
      
      The patch ensures that zone_reclaimable_pages() includes free pages
      (NR_FREE_PAGES) in its calculation when no other reclaimable pages are
      available (e.g., file-backed or anonymous pages).  This change prevents
      zones like ZONE_DMA32, which have sufficient free pages, from being
      mistakenly deemed unreclaimable.  By doing so, the patch ensures proper
      node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
      and prevents infinite loops in throttle_direct_reclaim() caused by
      allow_direct_reclaim(pgdat) repeatedly returning false.
      
      
      The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
      by a node being incorrectly deemed balanced despite pressure in certain
      zones, such as ZONE_NORMAL.  This issue arises from
      zone_reclaimable_pages() returning 0 for zones without reclaimable file-
      backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
      free pages to be skipped.
      
      The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
      during reclaim, masking pressure in other zones.  Consequently,
      pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
      mechanisms in allow_direct_reclaim() from being triggered, leading to an
      infinite loop in throttle_direct_reclaim().
      
      This patch modifies zone_reclaimable_pages() to account for free pages
      (NR_FREE_PAGES) when no other reclaimable pages exist.  This ensures zones
      with sufficient free pages are not skipped, enabling proper balancing and
      reclaim behavior.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
      Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
      
      
      Fixes: 5a1c84b4 ("mm: remove reclaim and compaction retry approximations")
      Signed-off-by: default avatarSeiji Nishikawa <snishika@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58d0d02d
    • Alessandro Carminati's avatar
      mm/kmemleak: fix sleeping function called from invalid context at print message · 64b2d32f
      Alessandro Carminati authored
      commit cddc76b165161a02ff14c4d84d0f5266d9d32b9e upstream.
      
      Address a bug in the kernel that triggers a "sleeping function called from
      invalid context" warning when /sys/kernel/debug/kmemleak is printed under
      specific conditions:
      - CONFIG_PREEMPT_RT=y
      - Set SELinux as the LSM for the system
      - Set kptr_restrict to 1
      - kmemleak buffer contains at least one item
      
      BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
      in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
      preempt_count: 1, expected: 0
      RCU nest depth: 2, expected: 2
      6 locks held by cat/136:
       #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
       #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
       #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
       #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
       #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
      irq event stamp: 136660
      hardirqs last  enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
      hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
      softirqs last  enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      Preemption disabled at:
      [<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
      CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G            E      6.11.0-rt7+ #34
      Tainted: [E]=UNSIGNED_MODULE
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0xa0/0x128
       show_stack+0x1c/0x30
       dump_stack_lvl+0xe8/0x198
       dump_stack+0x18/0x20
       rt_spin_lock+0x8c/0x1a8
       avc_perm_nonode+0xa0/0x150
       cred_has_capability.isra.0+0x118/0x218
       selinux_capable+0x50/0x80
       security_capable+0x7c/0xd0
       has_ns_capability_noaudit+0x94/0x1b0
       has_capability_noaudit+0x20/0x30
       restricted_pointer+0x21c/0x4b0
       pointer+0x298/0x760
       vsnprintf+0x330/0xf70
       seq_printf+0x178/0x218
       print_unreferenced+0x1a4/0x2d0
       kmemleak_seq_show+0xd0/0x1e0
       seq_read_iter+0x354/0xe30
       seq_read+0x250/0x378
       full_proxy_read+0xd8/0x148
       vfs_read+0x190/0x918
       ksys_read+0xf0/0x1e0
       __arm64_sys_read+0x70/0xa8
       invoke_syscall.constprop.0+0xd4/0x1d8
       el0_svc+0x50/0x158
       el0t_64_sync+0x17c/0x180
      
      %pS and %pK, in the same back trace line, are redundant, and %pS can void
      %pK service in certain contexts.
      
      %pS alone already provides the necessary information, and if it cannot
      resolve the symbol, it falls back to printing the raw address voiding
      the original intent behind the %pK.
      
      Additionally, %pK requires a privilege check CAP_SYSLOG enforced through
      the LSM, which can trigger a "sleeping function called from invalid
      context" warning under RT_PREEMPT kernels when the check occurs in an
      atomic context. This issue may also affect other LSMs.
      
      This change avoids the unnecessary privilege check and resolves the
      sleeping function warning without any loss of information.
      
      Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com
      
      
      Fixes: 3a6f33d8 ("mm/kmemleak: use %pK to display kernel pointers in backtrace")
      Signed-off-by: default avatarAlessandro Carminati <acarmina@redhat.com>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Clément Léger <clement.leger@bootlin.com>
      Cc: Alessandro Carminati <acarmina@redhat.com>
      Cc: Eric Chanudet <echanude@redhat.com>
      Cc: Gabriele Paoloni <gpaoloni@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64b2d32f
    • Yafang Shao's avatar
      mm/readahead: fix large folio support in async readahead · 5802fe9c
      Yafang Shao authored
      commit 158cdce87c8c172787063998ad5dd3e2f658b963 upstream.
      
      When testing large folio support with XFS on our servers, we observed that
      only a few large folios are mapped when reading large files via mmap.
      After a thorough analysis, I identified it was caused by the
      `/sys/block/*/queue/read_ahead_kb` setting.  On our test servers, this
      parameter is set to 128KB.  After I tune it to 2MB, the large folio can
      work as expected.  However, I believe the large folio behavior should not
      be dependent on the value of read_ahead_kb.  It would be more robust if
      the kernel can automatically adopt to it.
      
      With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
      sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
      /proc/meminfo are as follows:
      
      - before this patch
        FileHugePages:     18432 kB
        FilePmdMapped:      4096 kB
      
      - after this patch
        FileHugePages:   1067008 kB
        FilePmdMapped:   1048576 kB
      
      This shows that after applying the patch, the entire 1GB file is mapped to
      huge pages.  The stable list is CCed, as without this patch, large folios
      don't function optimally in the readahead path.
      
      It's worth noting that if read_ahead_kb is set to a larger value that
      isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
      to map to hugepages.
      
      Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
      Link: https://lkml.kernel.org/r/20241206083025.3478-1-laoar.shao@gmail.com
      
      
      Fixes: 4687fdbb ("mm/filemap: Support VM_HUGEPAGE for file mappings")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5802fe9c
    • Baolin Wang's avatar
      mm: shmem: fix incorrect index alignment for within_size policy · 9e4c11d4
      Baolin Wang authored
      commit d0e6983a6d1719738cf8d13982a68094f0a1872a upstream.
      
      With enabling the shmem per-size within_size policy, using an incorrect
      'order' size to round_up() the index can lead to incorrect i_size checks,
      resulting in an inappropriate large orders being returned.
      
      Changing to use '1 << order' to round_up() the index to fix this issue.
      Additionally, adding an 'aligned_index' variable to avoid affecting the
      index checks.
      
      Link: https://lkml.kernel.org/r/77d8ef76a7d3d646e9225e9af88a76549a68aab1.1734593154.git.baolin.wang@linux.alibaba.com
      
      
      Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e4c11d4
    • Baolin Wang's avatar
      mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' · cabacb18
      Baolin Wang authored
      commit d77b90d2b2642655b5f60953c36ad887257e1802 upstream.
      
      The 'shmem_falloc->nr_unswapped' is used to record how many writepage
      refused to swap out because fallocate() is allocating, but after shmem
      supports large folio swap out, the update of 'shmem_falloc->nr_unswapped'
      does not use the correct number of pages in the large folio, which may
      lead to fallocate() not exiting as soon as possible.
      
      Anyway, this is found through code inspection, and I am not sure whether
      it would actually cause serious issues.
      
      Link: https://lkml.kernel.org/r/f66a0119d0564c2c37c84f045835b870d1b2196f.1734593154.git.baolin.wang@linux.alibaba.com
      
      
      Fixes: 809bc865 ("mm: shmem: support large folio swap out")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cabacb18
    • SeongJae Park's avatar
      mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() · 3647932d
      SeongJae Park authored
      commit 8debfc5b1aa569d3d2ac836af2553da037611c61 upstream.
      
      Patch series "mm/damon/core: fix memory leaks and ignored inputs from
      damon_commit_ctx()".
      
      Due to two bugs in damon_commit_targets() and damon_commit_schemes(),
      which are called from damon_commit_ctx(), some user inputs can be ignored,
      and some mmeory objects can be leaked.  Fix those.
      
      Note that only DAMON sysfs interface users are affected.  Other DAMON core
      API user modules that more focused more on simple and dedicated production
      usages, including DAMON_RECLAIM and DAMON_LRU_SORT are not using the buggy
      function in the way, so not affected.
      
      
      This patch (of 2):
      
      When new DAMON targets are added via damon_commit_targets(), the newly
      created targets are not deallocated when updating the internal data
      (damon_commit_target()) is failed.  Worse yet, even if the setup is
      successfully done, the new target is not linked to the context.  Hence,
      the new targets are always leaked regardless of the internal data setup
      failure.  Fix the leaks.
      
      Link: https://lkml.kernel.org/r/20241222231222.85060-2-sj@kernel.org
      
      
      Fixes: 9cb3d0b9 ("mm/damon/core: implement DAMON context commit function")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3647932d
    • SeongJae Park's avatar
      mm/damon/core: fix ignored quota goals and filters of newly committed schemes · 69bbaa0f
      SeongJae Park authored
      commit 7d390b53067ef745e2d9bee5a9683df4c96b80a0 upstream.
      
      damon_commit_schemes() ignores quota goals and filters of the newly
      committed schemes.  This makes users confused about the behaviors.
      Correctly handle those inputs.
      
      Link: https://lkml.kernel.org/r/20241222231222.85060-3-sj@kernel.org
      
      
      Fixes: 9cb3d0b9 ("mm/damon/core: implement DAMON context commit function")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69bbaa0f
  5. Dec 27, 2024
  6. Dec 19, 2024
    • Shakeel Butt's avatar
      memcg: slub: fix SUnreclaim for post charged objects · 825bccd9
      Shakeel Butt authored
      
      commit b7ffecbe198e2dfc44abf92ceb90f46150f7527a upstream.
      
      Large kmalloc directly allocates from the page allocator and then use
      lruvec_stat_mod_folio() to increment the unreclaimable slab stats for
      global and memcg. However when post memcg charging of slab objects was
      added in commit 9028cdeb ("memcg: add charging of already allocated
      slab objects"), it missed to correctly handle the unreclaimable slab
      stats for memcg.
      
      One user visisble effect of that bug is that the node level
      unreclaimable slab stat will work correctly but the memcg level stat can
      underflow as kernel correctly handles the free path but the charge path
      missed to increment the memcg level unreclaimable slab stat. Let's fix
      by correctly handle in the post charge code path.
      
      Fixes: 9028cdeb ("memcg: add charging of already allocated slab objects")
      Signed-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      825bccd9
  7. Dec 14, 2024
    • Mike Rapoport (Microsoft)'s avatar
      memblock: allow zero threshold in validate_numa_converage() · 2cec2d91
      Mike Rapoport (Microsoft) authored
      commit 9cdc6423acb49055efb444ecd895d853a70ef931 upstream.
      
      Currently memblock validate_numa_converage() returns false negative when
      threshold set to zero.
      
      Make the check if the memory size with invalid node ID is greater than
      the threshold exclusive to fix that.
      
      Link: https://lore.kernel.org/all/Z0mIDBD4KLyxyOCm@kernel.org/
      
      
      Signed-off-by: default avatarMike Rapoport (Microsoft) <rppt@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cec2d91
    • Kalesh Singh's avatar
      mm: respect mmap hint address when aligning for THP · fe1a34e9
      Kalesh Singh authored
      commit 249608ee47132cab3b1adacd9e463548f57bd316 upstream.
      
      Commit efa7df3e ("mm: align larger anonymous mappings on THP
      boundaries") updated __get_unmapped_area() to align the start address for
      the VMA to a PMD boundary if CONFIG_TRANSPARENT_HUGEPAGE=y.
      
      It does this by effectively looking up a region that is of size,
      request_size + PMD_SIZE, and aligning up the start to a PMD boundary.
      
      Commit 4ef9ad19 ("mm: huge_memory: don't force huge page alignment on
      32 bit") opted out of this for 32bit due to regressions in mmap base
      randomization.
      
      Commit d4148aea ("mm, mmap: limit THP alignment of anonymous mappings
      to PMD-aligned sizes") restricted this to only mmap sizes that are
      multiples of the PMD_SIZE due to reported regressions in some performance
      benchmarks -- which seemed mostly due to the reduced spatial locality of
      related mappings due to the forced PMD-alignment.
      
      Another unintended side effect has emerged: When a user specifies an mmap
      hint address, the THP alignment logic modifies the behavior, potentially
      ignoring the hint even if a sufficiently large gap exists at the requested
      hint location.
      
      Example Scenario:
      
      Consider the following simplified virtual address (VA) space:
      
          ...
      
          0x200000-0x400000 --- VMA A
          0x400000-0x600000 --- Hole
          0x600000-0x800000 --- VMA B
      
          ...
      
      A call to mmap() with hint=0x400000 and len=0x200000 behaves differently:
      
        - Before THP alignment: The requested region (size 0x200000) fits into
          the gap at 0x400000, so the hint is respected.
      
        - After alignment: The logic searches for a region of size
          0x400000 (len + PMD_SIZE) starting at 0x400000.
          This search fails due to the mapping at 0x600000 (VMA B), and the hint
          is ignored, falling back to arch_get_unmapped_area[_topdown]().
      
      In general the hint is effectively ignored, if there is any existing
      mapping in the below range:
      
           [mmap_hint + mmap_size, mmap_hint + mmap_size + PMD_SIZE)
      
      This changes the semantics of mmap hint; from ""Respect the hint if a
      sufficiently large gap exists at the requested location" to "Respect the
      hint only if an additional PMD-sized gap exists beyond the requested
      size".
      
      This has performance implications for allocators that allocate their heap
      using mmap but try to keep it "as contiguous as possible" by using the end
      of the exisiting heap as the address hint.  With the new behavior it's
      more likely to get a much less contiguous heap, adding extra fragmentation
      and performance overhead.
      
      To restore the expected behavior; don't use
      thp_get_unmapped_area_vmflags() when the user provided a hint address, for
      anonymous mappings.
      
      Note: As Yang Shi pointed out: the issue still remains for filesystems
      which are using thp_get_unmapped_area() for their get_unmapped_area() op.
      It is unclear what worklaods will regress for if we ignore THP alignment
      when the hint address is provided for such file backed mappings -- so this
      fix will be handled separately.
      
      Link: https://lkml.kernel.org/r/20241118214650.3667577-1-kaleshsingh@google.com
      
      
      Fixes: efa7df3e ("mm: align larger anonymous mappings on THP boundaries")
      Signed-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <yang@os.amperecomputing.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hans Boehm <hboehm@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe1a34e9
    • Andrii Nakryiko's avatar
      mm: fix vrealloc()'s KASAN poisoning logic · 536ffb40
      Andrii Nakryiko authored
      commit d699440f58ce9bd71103cc7b692e3ab76a20bfcd upstream.
      
      When vrealloc() reuses already allocated vmap_area, we need to re-annotate
      poisoned and unpoisoned portions of underlying memory according to the new
      size.
      
      This results in a KASAN splat recorded at [1].  A KASAN mis-reporting
      issue where there is none.
      
      Note, hard-coding KASAN_VMALLOC_PROT_NORMAL might not be exactly correct,
      but KASAN flag logic is pretty involved and spread out throughout
      __vmalloc_node_range_noprof(), so I'm using the bare minimum flag here and
      leaving the rest to mm people to refactor this logic and reuse it here.
      
      Link: https://lkml.kernel.org/r/20241126005206.3457974-1-andrii@kernel.org
      Link: https://lore.kernel.org/bpf/67450f9b.050a0220.21d33d.0004.GAE@google.com/
      
       [1]
      Fixes: 3ddc2fef ("mm: vmalloc: implement vrealloc()")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      536ffb40
    • Matthew Wilcox (Oracle)'s avatar
      mm: open-code page_folio() in dump_page() · bd4d2333
      Matthew Wilcox (Oracle) authored
      commit 6a7de1bf218d75f27f68d6a3f5ae1eb7332b941e upstream.
      
      page_folio() calls page_fixed_fake_head() which will misidentify this page
      as being a fake head and load off the end of 'precise'.  We may have a
      pointer to a fake head, but that's OK because it contains the right
      information for dump_page().
      
      gcc-15 is smart enough to catch this with -Warray-bounds:
      
      In function 'page_fixed_fake_head',
          inlined from '_compound_head' at ../include/linux/page-flags.h:251:24,
          inlined from '__dump_page' at ../mm/debug.c:123:11:
      ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside
      +array bounds of 'struct page[1]' [-Warray-bounds=]
      
      Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org
      
      
      Fixes: fae7d834 ("mm: add __dump_folio()")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarKees Cook <kees@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd4d2333
    • John Sperbeck's avatar
      mm: memcg: declare do_memsw_account inline · 35e8f912
      John Sperbeck authored
      commit 89dd878282881306c38f7e354e7614fca98cb9a6 upstream.
      
      In commit 66d60c42 ("mm: memcg: move legacy memcg event code into
      memcontrol-v1.c"), the static do_memsw_account() function was moved from a
      .c file to a .h file.  Unfortunately, the traditional inline keyword
      wasn't added.  If a file (e.g., a unit test) includes the .h file, but
      doesn't refer to do_memsw_account(), it will get a warning like:
      
      mm/memcontrol-v1.h:41:13: warning: unused function 'do_memsw_account' [-Wunused-function]
         41 | static bool do_memsw_account(void)
            |             ^~~~~~~~~~~~~~~~
      
      Link: https://lkml.kernel.org/r/20241128203959.726527-1-jsperbeck@google.com
      
      
      Fixes: 66d60c42 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c")
      Signed-off-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35e8f912
    • David Hildenbrand's avatar
      mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MM · 42d9fe2a
      David Hildenbrand authored
      commit 091c1dd2d4df6edd1beebe0e5863d4034ade9572 upstream.
      
      We currently assume that there is at least one VMA in a MM, which isn't
      true.
      
      So we might end up having find_vma() return NULL, to then de-reference
      NULL.  So properly handle find_vma() returning NULL.
      
      This fixes the report:
      
      Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
      KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
      CPU: 1 UID: 0 PID: 6021 Comm: syz-executor284 Not tainted 6.12.0-rc7-syzkaller-00187-gf868cd251776 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
      RIP: 0010:migrate_to_node mm/mempolicy.c:1090 [inline]
      RIP: 0010:do_migrate_pages+0x403/0x6f0 mm/mempolicy.c:1194
      Code: ...
      RSP: 0018:ffffc9000375fd08 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffffc9000375fd78 RCX: 0000000000000000
      RDX: ffff88807e171300 RSI: dffffc0000000000 RDI: ffff88803390c044
      RBP: ffff88807e171428 R08: 0000000000000014 R09: fffffbfff2039ef1
      R10: ffffffff901cf78f R11: 0000000000000000 R12: 0000000000000003
      R13: ffffc9000375fe90 R14: ffffc9000375fe98 R15: ffffc9000375fdf8
      FS:  00005555919e1380(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005555919e1ca8 CR3: 000000007f12a000 CR4: 00000000003526f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       kernel_migrate_pages+0x5b2/0x750 mm/mempolicy.c:1709
       __do_sys_migrate_pages mm/mempolicy.c:1727 [inline]
       __se_sys_migrate_pages mm/mempolicy.c:1723 [inline]
       __x64_sys_migrate_pages+0x96/0x100 mm/mempolicy.c:1723
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      [akpm@linux-foundation.org: add unlikely()]
      Link: https://lkml.kernel.org/r/20241120201151.9518-1-david@redhat.com
      
      
      Fixes: 39743889 ("[PATCH] Swap Migration V5: sys_migrate_pages interface")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatar <syzbot+3511625422f7aa637f0d@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/lkml/673d2696.050a0220.3c9d61.012f.GAE@google.com/T/
      
      
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      42d9fe2a
    • John Hubbard's avatar
      mm/gup: handle NULL pages in unpin_user_pages() · 69d31945
      John Hubbard authored
      commit a1268be280d8e484ab3606d7476edd0f14bb9961 upstream.
      
      The recent addition of "pofs" (pages or folios) handling to gup has a
      flaw: it assumes that unpin_user_pages() handles NULL pages in the pages**
      array.  That's not the case, as I discovered when I ran on a new
      configuration on my test machine.
      
      Fix this by skipping NULL pages in unpin_user_pages(), just like
      unpin_folios() already does.
      
      Details: when booting on x86 with "numa=fake=2 movablecore=4G" on Linux
      6.12, and running this:
      
          tools/testing/selftests/mm/gup_longterm
      
      ...I get the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000008
      RIP: 0010:sanity_check_pinned_pages+0x3a/0x2d0
      ...
      Call Trace:
       <TASK>
       ? __die_body+0x66/0xb0
       ? page_fault_oops+0x30c/0x3b0
       ? do_user_addr_fault+0x6c3/0x720
       ? irqentry_enter+0x34/0x60
       ? exc_page_fault+0x68/0x100
       ? asm_exc_page_fault+0x22/0x30
       ? sanity_check_pinned_pages+0x3a/0x2d0
       unpin_user_pages+0x24/0xe0
       check_and_migrate_movable_pages_or_folios+0x455/0x4b0
       __gup_longterm_locked+0x3bf/0x820
       ? mmap_read_lock_killable+0x12/0x50
       ? __pfx_mmap_read_lock_killable+0x10/0x10
       pin_user_pages+0x66/0xa0
       gup_test_ioctl+0x358/0xb20
       __se_sys_ioctl+0x6b/0xc0
       do_syscall_64+0x7b/0x150
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Link: https://lkml.kernel.org/r/20241121034933.77502-1-jhubbard@nvidia.com
      
      
      Fixes: 94efde1d ("mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases")
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dongwon Kim <dongwon.kim@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Junxiao Chang <junxiao.chang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69d31945
Loading