Skip to content
Snippets Groups Projects
  1. Mar 03, 2025
  2. Jan 14, 2025
    • Kees Cook's avatar
      lib: stackinit: hide never-taken branch from compiler · a138ad38
      Kees Cook authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 5c3793604f91123bf49bc792ce697a0bef4c173c upstream.
      
      The never-taken branch leads to an invalid bounds condition, which is by
      design. To avoid the unwanted warning from the compiler, hide the
      variable from the optimizer.
      
      ../lib/stackinit_kunit.c: In function 'do_nothing_u16_zero':
      ../lib/stackinit_kunit.c:51:49: error: array subscript 1 is outside array bounds of 'u16[0]' {aka 'short unsigned int[]'} [-Werror=array-bounds=]
         51 | #define DO_NOTHING_RETURN_SCALAR(ptr)           *(ptr)
            |                                                 ^~~~~~
      ../lib/stackinit_kunit.c:219:24: note: in expansion of macro 'DO_NOTHING_RETURN_SCALAR'
        219 |                 return DO_NOTHING_RETURN_ ## which(ptr + 1);    \
            |                        ^~~~~~~~~~~~~~~~~~
      
      Link: https://lkml.kernel.org/r/20241117113813.work.735-kees@kernel.org
      
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a138ad38
    • Wei Yang's avatar
      maple_tree: refine mas_store_root() on storing NULL · 98504142
      Wei Yang authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 0ea120b278ad7f7cfeeb606e150ad04b192df60b upstream.
      
      Currently, when storing NULL on mas_store_root(), the behavior could be
      improved.
      
      Storing NULLs over the entire tree may result in a node being used to
      store a single range.  Further stores of NULL may cause the node and
      tree to be corrupt and cause incorrect behaviour.  Fixing the store to
      the root null fixes the issue by ensuring that a range of 0 - ULONG_MAX
      results in an empty tree.
      
      Users of the tree may experience incorrect values returned if the tree
      was expanded to store values, then overwritten by all NULLS, then
      continued to store NULLs over the empty area.
      
      For example possible cases are:
      
        * store NULL at any range result a new node
        * store NULL at range [m, n] where m > 0 to a single entry tree result
          a new node with range [m, n] set to NULL
        * store NULL at range [m, n] where m > 0 to an empty tree result
          consecutive NULL slot
        * it allows for multiple NULL entries by expanding root
          to store NULLs to an empty tree
      
      This patch tries to improve in:
      
        * memory efficient by setting to empty tree instead of using a node
        * remove the possibility of consecutive NULL slot which will prohibit
          extended null in later operation
      
      Link: https://lkml.kernel.org/r/20241031231627.14316-5-richard.weiyang@gmail.com
      
      
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98504142
    • Bartosz Golaszewski's avatar
      lib: string_helpers: silence snprintf() output truncation warning · 00df2636
      Bartosz Golaszewski authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit a508ef4b1dcc82227edc594ffae583874dd425d7 upstream.
      
      The output of ".%03u" with the unsigned int in range [0, 4294966295] may
      get truncated if the target buffer is not 12 bytes. This can't really
      happen here as the 'remainder' variable cannot exceed 999 but the
      compiler doesn't know it. To make it happy just increase the buffer to
      where the warning goes away.
      
      Fixes: 3c9f3681 ("[SCSI] lib: add generic helper to print sizes rounded to the correct SI range")
      Signed-off-by: default avatarBartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Reviewed-by: default avatarAndy Shevchenko <andy@kernel.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lore.kernel.org/r/20241101205453.9353-1-brgl@bgdev.pl
      
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00df2636
    • Jiri Olsa's avatar
      lib/buildid: Fix build ID parsing logic · e2043bd5
      Jiri Olsa authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      The parse_build_id_buf does not account Elf32_Nhdr header size
      when getting the build id data pointer and returns wrong build
      id data as result.
      
      This is problem only for stable trees that merged 84887f4c
      fix, the upstream build id code was refactored and returns proper
      build id.
      
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Fixes: 84887f4c ("lib/buildid: harden build ID parsing logic")
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2043bd5
    • Lorenzo Stoakes's avatar
      maple_tree: correct tree corruption on spanning store · 621f91dd
      Lorenzo Stoakes authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit bea07fd6 upstream.
      
      Patch series "maple_tree: correct tree corruption on spanning store", v3.
      
      There has been a nasty yet subtle maple tree corruption bug that appears
      to have been in existence since the inception of the algorithm.
      
      This bug seems far more likely to happen since commit f8d112a4
      ("mm/mmap: avoid zeroing vma tree in mmap_region()"), which is the point
      at which reports started to be submitted concerning this bug.
      
      We were made definitely aware of the bug thanks to the kind efforts of
      Bert Karwatzki who helped enormously in my being able to track this down
      and identify the cause of it.
      
      The bug arises when an attempt is made to perform a spanning store across
      two leaf nodes, where the right leaf node is the rightmost child of the
      shared parent, AND the store completely consumes the right-mode node.
      
      This results in mas_wr_spanning_store() mitakenly duplicating the new and
      existing entries at the maximum pivot within the range, and thus maple
      tree corruption.
      
      The fix patch corrects this by detecting this scenario and disallowing the
      mistaken duplicate copy.
      
      The fix patch commit message goes into great detail as to how this occurs.
      
      This series also includes a test which reliably reproduces the issue, and
      asserts that the fix works correctly.
      
      Bert has kindly tested the fix and confirmed it resolved his issues.  Also
      Mikhail Gavrilov kindly reported what appears to be precisely the same
      bug, which this fix should also resolve.
      
      
      This patch (of 2):
      
      There has been a subtle bug present in the maple tree implementation from
      its inception.
      
      This arises from how stores are performed - when a store occurs, it will
      overwrite overlapping ranges and adjust the tree as necessary to
      accommodate this.
      
      A range may always ultimately span two leaf nodes.  In this instance we
      walk the two leaf nodes, determine which elements are not overwritten to
      the left and to the right of the start and end of the ranges respectively
      and then rebalance the tree to contain these entries and the newly
      inserted one.
      
      This kind of store is dubbed a 'spanning store' and is implemented by
      mas_wr_spanning_store().
      
      In order to reach this stage, mas_store_gfp() invokes
      mas_wr_preallocate(), mas_wr_store_type() and mas_wr_walk() in turn to
      walk the tree and update the object (mas) to traverse to the location
      where the write should be performed, determining its store type.
      
      When a spanning store is required, this function returns false stopping at
      the parent node which contains the target range, and mas_wr_store_type()
      marks the mas->store_type as wr_spanning_store to denote this fact.
      
      When we go to perform the store in mas_wr_spanning_store(), we first
      determine the elements AFTER the END of the range we wish to store (that
      is, to the right of the entry to be inserted) - we do this by walking to
      the NEXT pivot in the tree (i.e.  r_mas.last + 1), starting at the node we
      have just determined contains the range over which we intend to write.
      
      We then turn our attention to the entries to the left of the entry we are
      inserting, whose state is represented by l_mas, and copy these into a 'big
      node', which is a special node which contains enough slots to contain two
      leaf node's worth of data.
      
      We then copy the entry we wish to store immediately after this - the copy
      and the insertion of the new entry is performed by mas_store_b_node().
      
      After this we copy the elements to the right of the end of the range which
      we are inserting, if we have not exceeded the length of the node (i.e.
      r_mas.offset <= r_mas.end).
      
      Herein lies the bug - under very specific circumstances, this logic can
      break and corrupt the maple tree.
      
      Consider the following tree:
      
      Height
        0                             Root Node
                                       /      \
                       pivot = 0xffff /        \ pivot = ULONG_MAX
                                     /          \
        1                       A [-----]       ...
                                   /   \
                   pivot = 0x4fff /     \ pivot = 0xffff
                                 /       \
        2 (LEAVES)          B [-----]  [-----] C
                                            ^--- Last pivot 0xffff.
      
      Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
      that all ranges expressed in maple tree code are inclusive):
      
      1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
         determines that this is a spanning store across nodes B and C. The mas
         state is set such that the current node from which we traverse further
         is node A.
      
      2. In mas_wr_spanning_store() we try to find elements to the right of pivot
         0xffff by searching for an index of 0x10000:
      
          - mas_wr_walk_index() invokes mas_wr_walk_descend() and
            mas_wr_node_walk() in turn.
      
              - mas_wr_node_walk() loops over entries in node A until EITHER it
                finds an entry whose pivot equals or exceeds 0x10000 OR it
                reaches the final entry.
      
              - Since no entry has a pivot equal to or exceeding 0x10000, pivot
                0xffff is selected, leading to node C.
      
          - mas_wr_walk_traverse() resets the mas state to traverse node C. We
            loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
            in turn once again.
      
               - Again, we reach the last entry in node C, which has a pivot of
                 0xffff.
      
      3. We then copy the elements to the left of 0x4000 in node B to the big
         node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
         too.
      
      4. We determine whether we have any entries to copy from the right of the
         end of the range via - and with r_mas set up at the entry at pivot
         0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
         pivot 0xffff.
      
      5. BUG! The maple tree is corrupted with a duplicate entry.
      
      This requires a very specific set of circumstances - we must be spanning
      the last element in a leaf node, which is the last element in the parent
      node.
      
      spanning store across two leaf nodes with a range that ends at that shared
      pivot.
      
      A potential solution to this problem would simply be to reset the walk
      each time we traverse r_mas, however given the rarity of this situation it
      seems that would be rather inefficient.
      
      Instead, this patch detects if the right hand node is populated, i.e.  has
      anything we need to copy.
      
      We do so by only copying elements from the right of the entry being
      inserted when the maximum value present exceeds the last, rather than
      basing this on offset position.
      
      The patch also updates some comments and eliminates the unused bool return
      value in mas_wr_walk_index().
      
      The work performed in commit f8d112a4 ("mm/mmap: avoid zeroing vma
      tree in mmap_region()") seems to have made the probability of this event
      much more likely, which is the point at which reports started to be
      submitted concerning this bug.
      
      The motivation for this change arose from Bert Karwatzki's report of
      encountering mm instability after the release of kernel v6.12-rc1 which,
      after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
      options, was identified as maple tree corruption.
      
      After Bert very generously provided his time and ability to reproduce this
      event consistently, I was able to finally identify that the issue
      discussed in this commit message was occurring for him.
      
      Link: https://lkml.kernel.org/r/cover.1728314402.git.lorenzo.stoakes@oracle.com
      Link: https://lkml.kernel.org/r/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com
      
      
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reported-by: default avatarBert Karwatzki <spasswolf@web.de>
      Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
      
      
      Tested-by: default avatarBert Karwatzki <spasswolf@web.de>
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
      
      
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      621f91dd
    • Masami Hiramatsu (Google)'s avatar
      bootconfig: Fix the kerneldoc of _xbc_exit() · 91c256e6
      Masami Hiramatsu (Google) authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit 298b871c ]
      
      Fix the kerneldoc of _xbc_exit() which is updated to have an @early
      argument and the function name is changed.
      
      Link: https://lore.kernel.org/all/171321744474.599864.13532445969528690358.stgit@devnote2/
      
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202404150036.kPJ3HEFA-lkp@intel.com/
      
      
      Fixes: 89f9a1e8 ("bootconfig: use memblock_free_late to free xbc memory to buddy")
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      91c256e6
    • Andrii Nakryiko's avatar
      lib/buildid: harden build ID parsing logic · 5491300a
      Andrii Nakryiko authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 905415ff ]
      
      Harden build ID parsing logic, adding explicit READ_ONCE() where it's
      important to have a consistent value read and validated just once.
      
      Also, as pointed out by Andi Kleen, we need to make sure that entire ELF
      note is within a page bounds, so move the overflow check up and add an
      extra note_size boundaries validation.
      
      Fixes tag below points to the code that moved this code into
      lib/buildid.c, and then subsequently was used in perf subsystem, making
      this code exposed to perf_event_open() users in v5.12+.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Reviewed-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Fixes: bd7525da ("bpf: Move stack_map_get_build_id into lib")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240829174232.3133883-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5491300a
    • Alexey Dobriyan's avatar
      build-id: require program headers to be right after ELF header · 6cdede39
      Alexey Dobriyan authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit 961a2851 ]
      
      Neither ELF spec not ELF loader require program header to be placed right
      after ELF header, but build-id code very much assumes such placement:
      
      See
      
      	find_get_page(vma->vm_file->f_mapping, 0);
      
      line and checks against PAGE_SIZE.
      
      Returns errors for now until someone rewrites build-id parser
      to be more inline with load_elf_binary().
      
      Link: https://lkml.kernel.org/r/d58bc281-6ca7-467a-9a64-40fa214bd63e@p183
      
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Stable-dep-of: 905415ff ("lib/buildid: harden build ID parsing logic")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6cdede39
    • Kairui Song's avatar
      mm/filemap: optimize filemap folio adding · d1065196
      Kairui Song authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 6758c112 upstream.
      
      Instead of doing multiple tree walks, do one optimism range check with
      lock hold, and exit if raced with another insertion.  If a shadow exists,
      check it with a new xas_get_order helper before releasing the lock to
      avoid redundant tree walks for getting its order.
      
      Drop the lock and do the allocation only if a split is needed.
      
      In the best case, it only need to walk the tree once.  If it needs to
      alloc and split, 3 walks are issued (One for first ranged conflict check
      and order retrieving, one for the second check after allocation, one for
      the insert after split).
      
      Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
      
        echo 3 > /proc/sys/vm/drop_caches
      
        fio -name=cached --numjobs=16 --filename=/mnt/test.img \
          --buffered=1 --ioengine=mmap --rw=randread --time_based \
          --ramp_time=30s --runtime=5m --group_reporting
      
      Before:
      bw (  MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
      iops        : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
      
      After (+7.3%):
      bw (  MiB/s): min=  493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
      iops        : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
      
      Test result with THP (do a THP randread then switch to 4K page in hope it
      issues a lot of splitting):
      
        echo 3 > /proc/sys/vm/drop_caches
      
        fio -name=cached --numjobs=16 --filename=/mnt/test.img \
            --buffered=1 --ioengine=mmap -thp=1 --readonly \
            --rw=randread --time_based --ramp_time=30s --runtime=10m \
            --group_reporting
      
        fio -name=cached --numjobs=16 --filename=/mnt/test.img \
            --buffered=1 --ioengine=mmap \
            --rw=randread --time_based --runtime=5s --group_reporting
      
      Before:
      bw (  KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
      iops        : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
      
      READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
      
      After (+12.5%):
      bw (  KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
      iops        : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
      
      READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
      
      The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
      
      Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
      
      
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Closes: https://lore.kernel.org/linux-mm/A5A976CB-DB57-4513-A700-656580488AB6@flyingcircus.io/
      
      
      [ kasong@tencent.com: minor adjustment of variable declarations ]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1065196
    • Kairui Song's avatar
      lib/xarray: introduce a new helper xas_get_order · e54ffb48
      Kairui Song authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit a4864671 upstream.
      
      It can be used after xas_load to check the order of loaded entries.
      Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.
      
      Added new test for xas_get_order, to make the test work, we have to export
      xas_get_order with EXPORT_SYMBOL_GPL.
      
      Also fix a sparse warning by checking the slot value with xa_entry instead
      of accessing it directly, as suggested by Matthew Wilcox.
      
      [kasong@tencent.com: simplify comment, sparse warning fix, per Matthew Wilcox]
        Link: https://lkml.kernel.org/r/20240416071722.45997-4-ryncsn@gmail.com
      Link: https://lkml.kernel.org/r/20240415171857.19244-4-ryncsn@gmail.com
      
      
      Signed-off-by: default avatarKairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Stable-dep-of: 6758c112 ("mm/filemap: optimize filemap folio adding")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e54ffb48
    • Zhen Lei's avatar
      debugobjects: Fix conditions in fill_pool() · a8de7eb6
      Zhen Lei authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 684d28fe upstream.
      
      fill_pool() uses 'obj_pool_min_free' to decide whether objects should be
      handed back to the kmem cache. But 'obj_pool_min_free' records the lowest
      historical value of the number of objects in the object pool and not the
      minimum number of objects which should be kept in the pool.
      
      Use 'debug_objects_pool_min_level' instead, which holds the minimum number
      which was scaled to the number of CPUs at boot time.
      
      [ tglx: Massage change log ]
      
      Fixes: d26bf505 ("debugobjects: Reduce number of pool_lock acquisitions in fill_pool()")
      Fixes: 36c4ead6 ("debugobjects: Add global free list and the counter")
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/all/20240904133944.2124-3-thunder.leizhen@huawei.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8de7eb6
    • Ming Lei's avatar
      lib/sbitmap: define swap_lock as raw_spinlock_t · 217bb0c8
      Ming Lei authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 65f666c6 ]
      
      When called from sbitmap_queue_get(), sbitmap_deferred_clear() may be run
      with preempt disabled. In RT kernel, spin_lock() can sleep, then warning
      of "BUG: sleeping function called from invalid context" can be triggered.
      
      Fix it by replacing it with raw_spin_lock.
      
      Cc: Yang Yang <yang.yang@vivo.com>
      Fixes: 72d04bdc ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarYang Yang <yang.yang@vivo.com>
      Link: https://lore.kernel.org/r/20240919021709.511329-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      217bb0c8
    • Lasse Collin's avatar
      xz: cleanup CRC32 edits from 2018 · 6d52c015
      Lasse Collin authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit 2ee96abe ]
      
      In 2018, a dependency on <linux/crc32poly.h> was added to avoid
      duplicating the same constant in multiple files.  Two months later it was
      found to be a bad idea and the definition of CRC32_POLY_LE macro was moved
      into xz_private.h to avoid including <linux/crc32poly.h>.
      
      xz_private.h is a wrong place for it too.  Revert back to the upstream
      version which has the poly in xz_crc32_init() in xz_crc32.c.
      
      Link: https://lkml.kernel.org/r/20240721133633.47721-10-lasse.collin@tukaani.org
      
      
      Fixes: faa16bc4 ("lib: Use existing define with polynomial")
      Fixes: 242cdad8 ("lib/xz: Put CRC32_POLY_LE in xz_private.h")
      Signed-off-by: default avatarLasse Collin <lasse.collin@tukaani.org>
      Reviewed-by: default avatarSam James <sam@gentoo.org>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jubin Zhong <zhongjubin@huawei.com>
      Cc: Jules Maselbas <jmaselbas@zdiv.net>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rui Li <me@lirui.org>
      Cc: Simon Glass <sjg@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6d52c015
    • Kent Overstreet's avatar
      lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc() · abb231df
      Kent Overstreet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit b2f11c6f ]
      
      If we need to increase the tree depth, allocate a new node, and then
      race with another thread that increased the tree depth before us, we'll
      still have a preallocated node that might be used later.
      
      If we then use that node for a new non-root node, it'll still have a
      pointer to the old root instead of being zeroed - fix this by zeroing it
      in the cmpxchg failure path.
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      abb231df
  3. Sep 17, 2024
    • Alexander Lobakin's avatar
      bitmap: introduce generic optimized bitmap_size() · 3986b50d
      Alexander Lobakin authored
      
      commit a37fbe66 upstream.
      
      The number of times yet another open coded
      `BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge.
      Some generic helper is long overdue.
      
      Add one, bitmap_size(), but with one detail.
      BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both
      divident and divisor are compile-time constants or when the divisor
      is not a pow-of-2. When it is however, the compilers sometimes tend
      to generate suboptimal code (GCC 13):
      
      48 83 c0 3f          	add    $0x3f,%rax
      48 c1 e8 06          	shr    $0x6,%rax
      48 8d 14 c5 00 00 00 00	lea    0x0(,%rax,8),%rdx
      
      %BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does
      full division of `nbits + 63` by it and then multiplication by 8.
      Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC:
      
      8d 50 3f             	lea    0x3f(%rax),%edx
      c1 ea 03             	shr    $0x3,%edx
      81 e2 f8 ff ff 1f    	and    $0x1ffffff8,%edx
      
      Now it shifts `nbits + 63` by 3 positions (IOW performs fast division
      by 8) and then masks bits[2:0]. bloat-o-meter:
      
      add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617)
      
      Clang does it better and generates the same code before/after starting
      from -O1, except that with the ALIGN() approach it uses %edx and thus
      still saves some bytes:
      
      add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520)
      
      Note that we can't expand DIV_ROUND_UP() by adding a check and using
      this approach there, as it's used in array declarations where
      expressions are not allowed.
      Add this helper to tools/ as well.
      
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Acked-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3986b50d
  4. Aug 12, 2024
    • Zijun Hu's avatar
      kobject_uevent: Fix OOB access within zap_modalias_env() · 75dddcbd
      Zijun Hu authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit dd6e9894 upstream.
      
      zap_modalias_env() wrongly calculates size of memory block to move, so
      will cause OOB memory access issue if variable MODALIAS is not the last
      one within its @env parameter, fixed by correcting size to memmove.
      
      Fixes: 9b3fa47d ("kobject: fix suppressing modalias in uevents delivered over netlink")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZijun Hu <quic_zijuhu@quicinc.com>
      Reviewed-by: default avatarLk Sii <lk_sii@163.com>
      Link: https://lore.kernel.org/r/1717074877-11352-1-git-send-email-quic_zijuhu@quicinc.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75dddcbd
    • Ross Lagerwall's avatar
      decompress_bunzip2: fix rare decompression failure · 24b42355
      Ross Lagerwall authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit bf6acd5d upstream.
      
      The decompression code parses a huffman tree and counts the number of
      symbols for a given bit length.  In rare cases, there may be >= 256
      symbols with a given bit length, causing the unsigned char to overflow.
      This causes a decompression failure later when the code tries and fails to
      find the bit length for a given symbol.
      
      Since the maximum number of symbols is 258, use unsigned short instead.
      
      Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com
      
      
      Fixes: bc22c17e ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression")
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Cc: Alain Knaff <alain@knaff.lu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24b42355
    • Yang Yang's avatar
      sbitmap: fix io hung due to race on sbitmap_word::cleared · 6c035efa
      Yang Yang authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 72d04bdc ]
      
      Configuration for sbq:
        depth=64, wake_batch=6, shift=6, map_nr=1
      
      1. There are 64 requests in progress:
        map->word = 0xFFFFFFFFFFFFFFFF
      2. After all the 64 requests complete, and no more requests come:
        map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF
      3. Now two tasks try to allocate requests:
        T1:                                       T2:
        __blk_mq_get_tag                          .
        __sbitmap_queue_get                       .
        sbitmap_get                               .
        sbitmap_find_bit                          .
        sbitmap_find_bit_in_word                  .
        __sbitmap_get_word  -> nr=-1              __blk_mq_get_tag
        sbitmap_deferred_clear                    __sbitmap_queue_get
        /* map->cleared=0xFFFFFFFFFFFFFFFF */     sbitmap_find_bit
          if (!READ_ONCE(map->cleared))           sbitmap_find_bit_in_word
            return false;                         __sbitmap_get_word -> nr=-1
          mask = xchg(&map->cleared, 0)           sbitmap_deferred_clear
          atomic_long_andnot()                    /* map->cleared=0 */
                                                    if (!(map->cleared))
                                                      return false;
                                           /*
                                            * map->cleared is cleared by T1
                                            * T2 fail to acquire the tag
                                            */
      
      4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken
      up due to the wake_batch being set at 6. If no more requests come, T1
      will wait here indefinitely.
      
      This patch achieves two purposes:
      1. Check on ->cleared and update on both ->cleared and ->word need to
      be done atomically, and using spinlock could be the simplest solution.
      2. Add extra check in sbitmap_deferred_clear(), to identify whether
      ->word has free bits.
      
      Fixes: ea86ea2c ("sbitmap: ammortize cost of clearing bits")
      Signed-off-by: default avatarYang Yang <yang.yang@vivo.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20240716082644.659566-1-yang.yang@vivo.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6c035efa
    • linke li's avatar
      sbitmap: use READ_ONCE to access map->word · dc3ce853
      linke li authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 6ad0d7e0 ]
      
      In __sbitmap_queue_get_batch(), map->word is read several times, and
      update atomically using atomic_long_try_cmpxchg(). But the first two read
      of map->word is not protected.
      
      This patch moves the statement val = READ_ONCE(map->word) forward,
      eliminating unprotected accesses to map->word within the function.
      It is aimed at reducing the number of benign races reported by KCSAN in
      order to focus future debugging effort on harmful races.
      
      Signed-off-by: default avatarlinke li <lilinke99@qq.com>
      Link: https://lore.kernel.org/r/tencent_0B517C25E519D3D002194E8445E86C04AD0A@qq.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Stable-dep-of: 72d04bdc ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc3ce853
    • Kemeng Shi's avatar
      sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code · 1e2a3d09
      Kemeng Shi authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 08470a98 ]
      
      Rewrite sbitmap_find_bit_in_index as following:
      1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
      2. Accept "struct sbitmap_word *" directly instead of accepting
      "struct sbitmap *" and "int index" to get "struct sbitmap_word *".
      3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
      to support need of both __sbitmap_get_shallow and __sbitmap_get.
      
      With helper function sbitmap_find_bit_in_word, we can remove repeat
      code in __sbitmap_get_shallow to find bit considring deferred clear.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Stable-dep-of: 72d04bdc ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1e2a3d09
    • Kemeng Shi's avatar
      sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow · 6ae74747
      Kemeng Shi authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit f1591a8b ]
      
      Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
      pointless and equivalent to setting alloc_hint to zero (because
      SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
      simplify the logic.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Stable-dep-of: 72d04bdc ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6ae74747
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_erp: Fix object nesting warning · 282435a6
      Ido Schimmel authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 97d833ce ]
      
      ACLs in Spectrum-2 and newer ASICs can reside in the algorithmic TCAM
      (A-TCAM) or in the ordinary circuit TCAM (C-TCAM). The former can
      contain more ACLs (i.e., tc filters), but the number of masks in each
      region (i.e., tc chain) is limited.
      
      In order to mitigate the effects of the above limitation, the device
      allows filters to share a single mask if their masks only differ in up
      to 8 consecutive bits. For example, dst_ip/25 can be represented using
      dst_ip/24 with a delta of 1 bit. The C-TCAM does not have a limit on the
      number of masks being used (and therefore does not support mask
      aggregation), but can contain a limited number of filters.
      
      The driver uses the "objagg" library to perform the mask aggregation by
      passing it objects that consist of the filter's mask and whether the
      filter is to be inserted into the A-TCAM or the C-TCAM since filters in
      different TCAMs cannot share a mask.
      
      The set of created objects is dependent on the insertion order of the
      filters and is not necessarily optimal. Therefore, the driver will
      periodically ask the library to compute a more optimal set ("hints") by
      looking at all the existing objects.
      
      When the library asks the driver whether two objects can be aggregated
      the driver only compares the provided masks and ignores the A-TCAM /
      C-TCAM indication. This is the right thing to do since the goal is to
      move as many filters as possible to the A-TCAM. The driver also forbids
      two identical masks from being aggregated since this can only happen if
      one was intentionally put in the C-TCAM to avoid a conflict in the
      A-TCAM.
      
      The above can result in the following set of hints:
      
      H1: {mask X, A-TCAM} -> H2: {mask Y, A-TCAM} // X is Y + delta
      H3: {mask Y, C-TCAM} -> H4: {mask Z, A-TCAM} // Y is Z + delta
      
      After getting the hints from the library the driver will start migrating
      filters from one region to another while consulting the computed hints
      and instructing the device to perform a lookup in both regions during
      the transition.
      
      Assuming a filter with mask X is being migrated into the A-TCAM in the
      new region, the hints lookup will return H1. Since H2 is the parent of
      H1, the library will try to find the object associated with it and
      create it if necessary in which case another hints lookup (recursive)
      will be performed. This hints lookup for {mask Y, A-TCAM} will either
      return H2 or H3 since the driver passes the library an object comparison
      function that ignores the A-TCAM / C-TCAM indication.
      
      This can eventually lead to nested objects which are not supported by
      the library [1].
      
      Fix by removing the object comparison function from both the driver and
      the library as the driver was the only user. That way the lookup will
      only return exact matches.
      
      I do not have a reliable reproducer that can reproduce the issue in a
      timely manner, but before the fix the issue would reproduce in several
      minutes and with the fix it does not reproduce in over an hour.
      
      Note that the current usefulness of the hints is limited because they
      include the C-TCAM indication and represent aggregation that cannot
      actually happen. This will be addressed in net-next.
      
      [1]
      WARNING: CPU: 0 PID: 153 at lib/objagg.c:170 objagg_obj_parent_assign+0xb5/0xd0
      Modules linked in:
      CPU: 0 PID: 153 Comm: kworker/0:18 Not tainted 6.9.0-rc6-custom-g70fbc2c1c38b #42
      Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      RIP: 0010:objagg_obj_parent_assign+0xb5/0xd0
      [...]
      Call Trace:
       <TASK>
       __objagg_obj_get+0x2bb/0x580
       objagg_obj_get+0xe/0x80
       mlxsw_sp_acl_erp_mask_get+0xb5/0xf0
       mlxsw_sp_acl_atcam_entry_add+0xe8/0x3c0
       mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
       mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270
       mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510
       process_one_work+0x151/0x370
      
      Fixes: 9069a381 ("lib: objagg: implement optimization hints assembly and use hints for object creation")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      282435a6
    • Ido Schimmel's avatar
      lib: objagg: Fix general protection fault · 0117d0cb
      Ido Schimmel authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit b4a3a89f ]
      
      The library supports aggregation of objects into other objects only if
      the parent object does not have a parent itself. That is, nesting is not
      supported.
      
      Aggregation happens in two cases: Without and with hints, where hints
      are a pre-computed recommendation on how to aggregate the provided
      objects.
      
      Nesting is not possible in the first case due to a check that prevents
      it, but in the second case there is no check because the assumption is
      that nesting cannot happen when creating objects based on hints. The
      violation of this assumption leads to various warnings and eventually to
      a general protection fault [1].
      
      Before fixing the root cause, error out when nesting happens and warn.
      
      [1]
      general protection fault, probably for non-canonical address 0xdead000000000d90: 0000 [#1] PREEMPT SMP PTI
      CPU: 1 PID: 1083 Comm: kworker/1:9 Tainted: G        W          6.9.0-rc6-custom-gd9b4f1cca7fb #7
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      RIP: 0010:mlxsw_sp_acl_erp_bf_insert+0x25/0x80
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp_acl_atcam_entry_add+0x256/0x3c0
       mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
       mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270
       mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510
       process_one_work+0x151/0x370
       worker_thread+0x2cb/0x3e0
       kthread+0xd0/0x100
       ret_from_fork+0x34/0x50
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      
      Fixes: 9069a381 ("lib: objagg: implement optimization hints assembly and use hints for object creation")
      Reported-by: default avatarAlexander Zubkov <green@qrator.net>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0117d0cb
    • Mickaël Salaün's avatar
      kunit: Fix timeout message · 9a3f95ca
      Mickaël Salaün authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 53026ff6 ]
      
      The exit code is always checked, so let's properly handle the -ETIMEDOUT
      error code.
      
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Gow <davidgow@google.com>
      Reviewed-by: default avatarRae Moar <rmoar@google.com>
      Signed-off-by: default avatarMickaël Salaün <mic@digikod.net>
      Link: https://lore.kernel.org/r/20240408074625.65017-4-mic@digikod.net
      
      
      Signed-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9a3f95ca
  5. Jul 11, 2024
  6. Jun 12, 2024
  7. May 13, 2024
    • Andrey Ryabinin's avatar
      stackdepot: respect __GFP_NOLOCKDEP allocation flag · 47f25124
      Andrey Ryabinin authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 6fe60465 upstream.
      
      If stack_depot_save_flags() allocates memory it always drops
      __GFP_NOLOCKDEP flag.  So when KASAN tries to track __GFP_NOLOCKDEP
      allocation we may end up with lockdep splat like bellow:
      
      ======================================================
       WARNING: possible circular locking dependency detected
       6.9.0-rc3+ #49 Not tainted
       ------------------------------------------------------
       kswapd0/149 is trying to acquire lock:
       ffff88811346a920
      (&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_reclaim_inode+0x3ac/0x590
      [xfs]
      
       but task is already holding lock:
       ffffffff8bb33100 (fs_reclaim){+.+.}-{0:0}, at:
      balance_pgdat+0x5d9/0xad0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
       -> #1 (fs_reclaim){+.+.}-{0:0}:
              __lock_acquire+0x7da/0x1030
              lock_acquire+0x15d/0x400
              fs_reclaim_acquire+0xb5/0x100
       prepare_alloc_pages.constprop.0+0xc5/0x230
              __alloc_pages+0x12a/0x3f0
              alloc_pages_mpol+0x175/0x340
              stack_depot_save_flags+0x4c5/0x510
              kasan_save_stack+0x30/0x40
              kasan_save_track+0x10/0x30
              __kasan_slab_alloc+0x83/0x90
              kmem_cache_alloc+0x15e/0x4a0
              __alloc_object+0x35/0x370
              __create_object+0x22/0x90
       __kmalloc_node_track_caller+0x477/0x5b0
              krealloc+0x5f/0x110
              xfs_iext_insert_raw+0x4b2/0x6e0 [xfs]
              xfs_iext_insert+0x2e/0x130 [xfs]
              xfs_iread_bmbt_block+0x1a9/0x4d0 [xfs]
              xfs_btree_visit_block+0xfb/0x290 [xfs]
              xfs_btree_visit_blocks+0x215/0x2c0 [xfs]
              xfs_iread_extents+0x1a2/0x2e0 [xfs]
       xfs_buffered_write_iomap_begin+0x376/0x10a0 [xfs]
              iomap_iter+0x1d1/0x2d0
       iomap_file_buffered_write+0x120/0x1a0
              xfs_file_buffered_write+0x128/0x4b0 [xfs]
              vfs_write+0x675/0x890
              ksys_write+0xc3/0x160
              do_syscall_64+0x94/0x170
       entry_SYSCALL_64_after_hwframe+0x71/0x79
      
      Always preserve __GFP_NOLOCKDEP to fix this.
      
      Link: https://lkml.kernel.org/r/20240418141133.22950-1-ryabinin.a.a@gmail.com
      
      
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Reported-by: default avatarXiubo Li <xiubli@redhat.com>
      Closes: https://lore.kernel.org/all/a0caa289-ca02-48eb-9bf2-d86fd47b71f4@redhat.com/
      
      
      Reported-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Closes: https://lore.kernel.org/all/f9ff999a-e170-b66b-7caf-293f2b147ac2@opensource.wdc.com/
      
      
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Tested-by: default avatarXiubo Li <xiubli@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47f25124
    • Qiang Zhang's avatar
      bootconfig: use memblock_free_late to free xbc memory to buddy · 4fc96063
      Qiang Zhang authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit 89f9a1e8 upstream.
      
      On the time to free xbc memory in xbc_exit(), memblock may has handed
      over memory to buddy allocator. So it doesn't make sense to free memory
      back to memblock. memblock_free() called by xbc_exit() even causes UAF bugs
      on architectures with CONFIG_ARCH_KEEP_MEMBLOCK disabled like x86.
      Following KASAN logs shows this case.
      
      This patch fixes the xbc memory free problem by calling memblock_free()
      in early xbc init error rewind path and calling memblock_free_late() in
      xbc exit path to free memory to buddy allocator.
      
      [    9.410890] ==================================================================
      [    9.418962] BUG: KASAN: use-after-free in memblock_isolate_range+0x12d/0x260
      [    9.426850] Read of size 8 at addr ffff88845dd30000 by task swapper/0/1
      
      [    9.435901] CPU: 9 PID: 1 Comm: swapper/0 Tainted: G     U             6.9.0-rc3-00208-g586b5dfb51b9 #5
      [    9.446403] Hardware name: Intel Corporation RPLP LP5 (CPU:RaptorLake)/RPLP LP5 (ID:13), BIOS IRPPN02.01.01.00.00.19.015.D-00000000 Dec 28 2023
      [    9.460789] Call Trace:
      [    9.463518]  <TASK>
      [    9.465859]  dump_stack_lvl+0x53/0x70
      [    9.469949]  print_report+0xce/0x610
      [    9.473944]  ? __virt_addr_valid+0xf5/0x1b0
      [    9.478619]  ? memblock_isolate_range+0x12d/0x260
      [    9.483877]  kasan_report+0xc6/0x100
      [    9.487870]  ? memblock_isolate_range+0x12d/0x260
      [    9.493125]  memblock_isolate_range+0x12d/0x260
      [    9.498187]  memblock_phys_free+0xb4/0x160
      [    9.502762]  ? __pfx_memblock_phys_free+0x10/0x10
      [    9.508021]  ? mutex_unlock+0x7e/0xd0
      [    9.512111]  ? __pfx_mutex_unlock+0x10/0x10
      [    9.516786]  ? kernel_init_freeable+0x2d4/0x430
      [    9.521850]  ? __pfx_kernel_init+0x10/0x10
      [    9.526426]  xbc_exit+0x17/0x70
      [    9.529935]  kernel_init+0x38/0x1e0
      [    9.533829]  ? _raw_spin_unlock_irq+0xd/0x30
      [    9.538601]  ret_from_fork+0x2c/0x50
      [    9.542596]  ? __pfx_kernel_init+0x10/0x10
      [    9.547170]  ret_from_fork_asm+0x1a/0x30
      [    9.551552]  </TASK>
      
      [    9.555649] The buggy address belongs to the physical page:
      [    9.561875] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x45dd30
      [    9.570821] flags: 0x200000000000000(node=0|zone=2)
      [    9.576271] page_type: 0xffffffff()
      [    9.580167] raw: 0200000000000000 ffffea0011774c48 ffffea0012ba1848 0000000000000000
      [    9.588823] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      [    9.597476] page dumped because: kasan: bad access detected
      
      [    9.605362] Memory state around the buggy address:
      [    9.610714]  ffff88845dd2ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [    9.618786]  ffff88845dd2ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [    9.626857] >ffff88845dd30000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [    9.634930]                    ^
      [    9.638534]  ffff88845dd30080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [    9.646605]  ffff88845dd30100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [    9.654675] ==================================================================
      
      Link: https://lore.kernel.org/all/20240414114944.1012359-1-qiang4.zhang@linux.intel.com/
      
      
      
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Stable@vger.kernel.org
      Signed-off-by: default avatarQiang Zhang <qiang4.zhang@intel.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fc96063
  8. Apr 11, 2024
Loading