Skip to content
Snippets Groups Projects
  1. Aug 02, 2024
  2. Jun 26, 2024
  3. Dec 08, 2023
  4. May 23, 2023
  5. Apr 18, 2023
  6. Apr 04, 2023
  7. Mar 05, 2023
    • Linus Torvalds's avatar
      cpumask: re-introduce constant-sized cpumask optimizations · 596ff4a0
      Linus Torvalds authored
      
      Commit aa47a7c2 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
      in the cpumask operations potentially becoming hugely less efficient,
      because suddenly the cpumask was always considered to be variable-sized.
      
      The optimization was then later added back in a limited form by commit
      6f9c07be ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
      FORCE_NR_CPUS option is not useful in a generic kernel and more of a
      special case for embedded situations with fixed hardware.
      
      Instead, just re-introduce the optimization, with some changes.
      
      Instead of depending on CPUMASK_OFFSTACK being false, and then always
      using the full constant cpumask width, this introduces three different
      cpumask "sizes":
      
       - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.
      
         This is used for situations where we should use the exact size.
      
       - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
         fits in a single word and the bitmap operations thus end up able
         to trigger the "small_const_nbits()" optimizations.
      
         This is used for the operations that have optimized single-word
         cases that get inlined, notably the bit find and scanning functions.
      
       - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
         is an sufficiently small constant that makes simple "copy" and
         "clear" operations more efficient.
      
         This is arbitrarily set at four words or less.
      
      As a an example of this situation, without this fixed size optimization,
      cpumask_clear() will generate code like
      
              movl    nr_cpu_ids(%rip), %edx
              addq    $63, %rdx
              shrq    $3, %rdx
              andl    $-8, %edx
              callq   memset@PLT
      
      on x86-64, because it would calculate the "exact" number of longwords
      that need to be cleared.
      
      In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
      reasonable value to use), the above becomes a single
      
      	movq $0,cpumask
      
      instruction instead, because instead of caring to figure out exactly how
      many CPU's the system has, it just knows that the cpumask will be a
      single word and can just clear it all.
      
      Note that this does end up tightening the rules a bit from the original
      version in another way: operations that set bits in the cpumask are now
      limited to the actual nr_cpu_ids limit, whereas we used to do the
      nr_cpumask_bits thing almost everywhere in the cpumask code.
      
      But if you just clear bits, or scan for bits, we can use the simpler
      compile-time constants.
      
      In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
      which were not useful, and which fundamentally have to be limited to
      'nr_cpu_ids'.  Better remove them now than have somebody introduce use
      of them later.
      
      Of course, on x86-64 with MAXSMP there is no sane small compile-time
      constant for the cpumask sizes, and we end up using the actual CPU bits,
      and will generate the above kind of horrors regardless.  Please don't
      use MAXSMP unless you really expect to have machines with thousands of
      cores.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      596ff4a0
  8. Jan 22, 2023
  9. Dec 02, 2022
  10. Dec 01, 2022
    • Florian Westphal's avatar
      inet: ping: use hlist_nulls rcu iterator during lookup · c25b7a7a
      Florian Westphal authored
      
      ping_lookup() does not acquire the table spinlock, so iteration should
      use hlist_nulls_for_each_entry_rcu().
      
      Spotted during code review.
      
      Fixes: dbca1596 ("ping: convert to RCU lookups, get rid of rwlock")
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Link: https://lore.kernel.org/r/20221129140644.28525-1-fw@strlen.de
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c25b7a7a
    • Jason Gunthorpe's avatar
      iommufd: Data structure to provide IOVA to PFN mapping · 51fe6141
      Jason Gunthorpe authored
      This is the remainder of the IOAS data structure. Provide an object called
      an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
      along with a list of iommu_domains that mirror the IOVA to PFN map.
      
      At the top this is a simple interval tree of iopt_areas indicating the map
      of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
      on the attached domains there is a minimum alignment for areas (which may
      be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't
      be mapped and an IOVA of allowed IOVA that can always be mappable.
      
      The concept of an 'access' refers to something like a VFIO mdev that is
      accessing the IOVA and using a 'struct page *' for CPU based access.
      
      Externally an API is provided that matches the requirements of the IOCTL
      interface for map/unmap and domain attachment.
      
      The API provides a 'copy' primitive to establish a new IOVA map in a
      different IOAS from an existing mapping by re-using the iopt_pages. This
      is the basic mechanism to provide single pinning.
      
      This is designed to support a pre-registration flow where userspace would
      setup an dummy IOAS with no domains, map in memory and then establish an
      access to pin all PFNs into the xarray.
      
      Copy can then be used to create new IOVA mappings in a different IOAS,
      with iommu_domains attached. Upon copy the PFNs will be read out of the
      xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
      overheads.
      
      Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
      
      
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarYi Liu <yi.l.liu@intel.com>
      Tested-by: default avatarLixiao Yang <lixiao.yang@intel.com>
      Tested-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Signed-off-by: default avatarYi Liu <yi.l.liu@intel.com>
      Signed-off-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      51fe6141
    • Jason Gunthorpe's avatar
      iommufd: PFN handling for iopt_pages · f394576e
      Jason Gunthorpe authored
      The top of the data structure provides an IO Address Space (IOAS) that is
      similar to a VFIO container. The IOAS allows map/unmap of memory into
      ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables)
      and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to
      access the PFNs that those IOVA areas cover.
      
      The IO Address Space (IOAS) datastructure is composed of:
       - struct io_pagetable holding the IOVA map
       - struct iopt_areas representing populated portions of IOVA
       - struct iopt_pages representing the storage of PFNs
       - struct iommu_domain representing each IO page table in the system IOMMU
       - struct iopt_pages_access representing in-kernel accesses of PFNs (ie
         VFIO mdevs)
       - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
         accesses
      
      This patch introduces the lowest part of the datastructure - the movement
      of PFNs in a tiered storage scheme:
       1) iopt_pages::pinned_pfns xarray
       2) Multiple iommu_domains
       3) The origin of the PFNs, i.e. the userspace pointer
      
      PFN have to be copied between all combinations of tiers, depending on the
      configuration.
      
      The interface is an iterator called a 'pfn_reader' which determines which
      tier each PFN is stored and loads it into a list of PFNs held in a struct
      pfn_batch.
      
      Each step of the iterator will fill up the pfn_batch, then the caller can
      use the pfn_batch to send the PFNs to the required destination. Repeating
      this loop will read all the PFNs in an IOVA range.
      
      The pfn_reader and pfn_batch also keep track of the pinned page accounting.
      
      While PFNs are always stored and accessed as full PAGE_SIZE units the
      iommu_domain tier can store with a sub-page offset/length to support
      IOMMUs with a smaller IOPTE size than PAGE_SIZE.
      
      Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
      
      
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarYi Liu <yi.l.liu@intel.com>
      Tested-by: default avatarLixiao Yang <lixiao.yang@intel.com>
      Tested-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      f394576e
  11. Nov 29, 2022
  12. Jul 19, 2022
  13. May 20, 2022
  14. Dec 16, 2021
  15. May 12, 2021
  16. Feb 17, 2021
    • Ben Widawsky's avatar
      cxl/mem: Add basic IOCTL interface · 583fa5e7
      Ben Widawsky authored
      
      Add a straightforward IOCTL that provides a mechanism for userspace to
      query the supported memory device commands. CXL commands as they appear
      to userspace are described as part of the UAPI kerneldoc. The command
      list returned via this IOCTL will contain the full set of commands that
      the driver supports, however, some of those commands may not be
      available for use by userspace.
      
      Memory device commands first appear in the CXL 2.0 specification. They
      are submitted through a mailbox mechanism specified in the CXL 2.0
      specification.
      
      The send command allows userspace to issue mailbox commands directly to
      the hardware. The list of available commands to send are the output of
      the query command. The driver verifies basic properties of the command
      and possibly inspect the input (or output) payload to determine whether
      or not the command is allowed (or might taint the kernel).
      
      Reported-by: kernel test robot <lkp@intel.com> # bug in earlier revision
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20210217040958.1354670-5-ben.widawsky@intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      583fa5e7
  17. Jan 29, 2021
  18. Oct 14, 2020
    • Mike Rapoport's avatar
      memblock: use separate iterators for memory and reserved regions · cc6de168
      Mike Rapoport authored
      
      for_each_memblock() is used to iterate over memblock.memory in a few
      places that use data from memblock_region rather than the memory ranges.
      
      Introduce separate for_each_mem_region() and
      for_each_reserved_mem_region() to improve encapsulation of memblock
      internals from its users.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: Ingo Molnar <mingo@kernel.org>			[x86]
      Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>	[MIPS]
      Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc6de168
    • Mike Rapoport's avatar
      memblock: implement for_each_reserved_mem_region() using __next_mem_region() · 9f3d5eaa
      Mike Rapoport authored
      
      Iteration over memblock.reserved with for_each_reserved_mem_region() used
      __next_reserved_mem_region() that implemented a subset of
      __next_mem_region().
      
      Use __for_each_mem_range() and, essentially, __next_mem_region() with
      appropriate parameters to reduce code duplication.
      
      While on it, rename for_each_reserved_mem_region() to
      for_each_reserved_mem_range() for consistency.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-17-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f3d5eaa
    • Mike Rapoport's avatar
      memblock: reduce number of parameters in for_each_mem_range() · 6e245ad4
      Mike Rapoport authored
      
      Currently for_each_mem_range() and for_each_mem_range_rev() iterators are
      the most generic way to traverse memblock regions.  As such, they have 8
      parameters and they are hardly convenient to users.  Most users choose to
      utilize one of their wrappers and the only user that actually needs most
      of the parameters is memblock itself.
      
      To avoid yet another naming for memblock iterators, rename the existing
      for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new
      for_each_mem_range[_rev]() wrappers with only index, start and end
      parameters.
      
      The new wrapper nicely fits into init_unavailable_mem() and will be used
      in upcoming changes to simplify memblock traversals.
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>	[MIPS]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-11-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e245ad4
  19. Sep 09, 2020
  20. Sep 01, 2020
  21. May 25, 2020
  22. Apr 18, 2020
  23. Mar 06, 2020
  24. Aug 31, 2019
  25. Apr 12, 2019
  26. Mar 21, 2019
  27. Feb 19, 2019
  28. Feb 11, 2019
    • Jason Gunthorpe's avatar
      lib/scatterlist: Provide a DMA page iterator · d901b276
      Jason Gunthorpe authored
      
      Commit 2db76d7c ("lib/scatterlist: sg_page_iter: support sg lists w/o
      backing pages") introduced the sg_page_iter_dma_address() function without
      providing a way to use it in the general case. If the sg_dma_len() is not
      equal to the sg length callers cannot safely use the
      for_each_sg_page/sg_page_iter_dma_address combination.
      
      Resolve this API mistake by providing a DMA specific iterator,
      for_each_sg_dma_page(), that uses the right length so
      sg_page_iter_dma_address() works as expected with all sglists.
      
      A new iterator type is introduced to provide compile-time safety against
      wrongly mixing accessors and iterators.
      
      Acked-by: Christoph Hellwig <hch@lst.de> (for scatterlist)
      Acked-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> (ipu3-cio2)
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d901b276
Loading