Skip to content
Snippets Groups Projects
  1. Dec 02, 2022
    • Kees Cook's avatar
      panic: Introduce warn_limit · 9fc9e278
      Kees Cook authored
      
      Like oops_limit, add warn_limit for limiting the number of warnings when
      panic_on_warn is not set.
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: tangmeng <tangmeng@uniontech.com>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-doc@vger.kernel.org
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
      9fc9e278
    • Kees Cook's avatar
      panic: Consolidate open-coded panic_on_warn checks · 79cc1ba7
      Kees Cook authored
      
      Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll
      their own warnings, and each check "panic_on_warn". Consolidate this
      into a single function so that future instrumentation can be added in
      a single location.
      
      Cc: Marco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Gow <davidgow@google.com>
      Cc: tangmeng <tangmeng@uniontech.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
      Cc: kasan-dev@googlegroups.com
      Cc: linux-mm@kvack.org
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
      79cc1ba7
    • Kees Cook's avatar
      exit: Allow oops_limit to be disabled · de92f657
      Kees Cook authored
      
      In preparation for keeping oops_limit logic in sync with warn_limit,
      have oops_limit == 0 disable checking the Oops counter.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      de92f657
  2. Dec 01, 2022
    • Kees Cook's avatar
      exit: Expose "oops_count" to sysfs · 9db89b41
      Kees Cook authored
      
      Since Oops count is now tracked and is a fairly interesting signal, add
      the entry /sys/kernel/oops_count to expose it to userspace.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org
      9db89b41
    • Jann Horn's avatar
      exit: Put an upper limit on how often we can oops · d4ccd54d
      Jann Horn authored
      
      Many Linux systems are configured to not panic on oops; but allowing an
      attacker to oops the system **really** often can make even bugs that look
      completely unexploitable exploitable (like NULL dereferences and such) if
      each crash elevates a refcount by one or a lock is taken in read mode, and
      this causes a counter to eventually overflow.
      
      The most interesting counters for this are 32 bits wide (like open-coded
      refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
      platforms is just 16 bits, but probably nobody cares about 32-bit platforms
      that much nowadays.)
      
      So let's panic the system if the kernel is constantly oopsing.
      
      The speed of oopsing 2^32 times probably depends on several factors, like
      how long the stack trace is and which unwinder you're using; an empirically
      important one is whether your console is showing a graphical environment or
      a text console that oopses will be printed to.
      In a quick single-threaded benchmark, it looks like oopsing in a vfork()
      child with a very short stack trace only takes ~510 microseconds per run
      when a graphical console is active; but switching to a text console that
      oopses are printed to slows it down around 87x, to ~45 milliseconds per
      run.
      (Adding more threads makes this faster, but the actual oops printing
      happens under &die_lock on x86, so you can maybe speed this up by a factor
      of around 2 and then any further improvement gets eaten up by lock
      contention.)
      
      It looks like it would take around 8-12 days to overflow a 32-bit counter
      with repeated oopsing on a multi-core X86 system running a graphical
      environment; both me (in an X86 VM) and Seth (with a distro kernel on
      normal hardware in a standard configuration) got numbers in that ballpark.
      
      12 days aren't *that* short on a desktop system, and you'd likely need much
      longer on a typical server system (assuming that people don't run graphical
      desktop environments on their servers), and this is a *very* noisy and
      violent approach to exploiting the kernel; and it also seems to take orders
      of magnitude longer on some machines, probably because stuff like EFI
      pstore will slow it down a ton if that's active.
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com
      
      
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
      d4ccd54d
    • Kees Cook's avatar
      panic: Separate sysctl logic from CONFIG_SMP · 9360d035
      Kees Cook authored
      
      In preparation for adding more sysctls directly in kernel/panic.c, split
      CONFIG_SMP from the logic that adds sysctls.
      
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: tangmeng <tangmeng@uniontech.com>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org
      9360d035
  3. Nov 01, 2022
    • Kees Cook's avatar
      cred: Do not default to init_cred in prepare_kernel_cred() · 5a17f040
      Kees Cook authored
      A common exploit pattern for ROP attacks is to abuse prepare_kernel_cred()
      in order to construct escalated privileges[1]. Instead of providing a
      short-hand argument (NULL) to the "daemon" argument to indicate using
      init_cred as the base cred, require that "daemon" is always set to
      an actual task. Replace all existing callers that were passing NULL
      with &init_task.
      
      Future attacks will need to have sufficiently powerful read/write
      primitives to have found an appropriately privileged task and written it
      to the ROP stack as an argument to succeed, which is similarly difficult
      to the prior effort needed to escalate privileges before struct cred
      existed: locate the current cred and overwrite the uid member.
      
      This has the added benefit of meaning that prepare_kernel_cred() can no
      longer exceed the privileges of the init task, which may have changed from
      the original init_cred (e.g. dropping capabilities from the bounding set).
      
      [1] https://google.com/search?q=commit_creds(prepare_kernel_cred(0))
      
      
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Steve French <sfrench@samba.org>
      Cc: Ronnie Sahlberg <lsahlber@redhat.com>
      Cc: Shyam Prasad N <sprasad@microsoft.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Namjae Jeon <linkinjeon@kernel.org>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: linux-nfs@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarRuss Weight <russell.h.weight@intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarPaulo Alcantara (SUSE) <pc@cjr.nz>
      Link: https://lore.kernel.org/r/20221026232943.never.775-kees@kernel.org
      5a17f040
  4. Oct 12, 2022
  5. Oct 11, 2022
    • Jason A. Donenfeld's avatar
      treewide: use get_random_bytes() when possible · 197173db
      Jason A. Donenfeld authored
      
      The prandom_bytes() function has been a deprecated inline wrapper around
      get_random_bytes() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. This was done as a basic find and replace.
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      197173db
    • Jason A. Donenfeld's avatar
      treewide: use get_random_u32() when possible · a251c17a
      Jason A. Donenfeld authored
      
      The prandom_u32() function has been a deprecated inline wrapper around
      get_random_u32() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. The same also applies to get_random_int(), which is
      just a wrapper around get_random_u32(). This was done as a basic find
      and replace.
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Acked-by: Helge Deller <deller@gmx.de> # for parisc
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      a251c17a
    • Jason A. Donenfeld's avatar
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld authored
      
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: default avatarKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
  6. Oct 10, 2022
  7. Oct 06, 2022
    • Valentin Schneider's avatar
      sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() · 585463f0
      Valentin Schneider authored
      
      This removes the second use of the sched_core_mask temporary mask.
      
      Suggested-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      585463f0
    • Steven Rostedt (Google)'s avatar
      tracing: Do not free snapshot if tracer is on cmdline · a541a955
      Steven Rostedt (Google) authored
      The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
      snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
      particular requires the snapshot to be allocated because it will take a
      snapshot at the end of boot up allowing to see the traces that happened
      during boot so that it's not lost when user space takes over.
      
      When a tracer is registered (started) there's a path that checks if it
      requires the snapshot buffer or not, and if it does not and it was
      allocated it will do a synchronization and free the snapshot buffer.
      
      This is only required if the previous tracer was using it for "max
      latency" snapshots, as it needs to make sure all max snapshots are
      complete before freeing. But this is only needed if the previous tracer
      was using the snapshot buffer for latency (like irqoff tracer and
      friends). But it does not make sense to free it, if the previous tracer
      was not using it, and the snapshot was allocated by the cmdline
      parameters. This basically takes away the point of allocating it in the
      first place!
      
      Note, the allocated snapshot worked fine for just trace events, but fails
      when a tracer is enabled on the cmdline.
      
      Further investigation, this goes back even further and it does not require
      a tracer on the cmdline to fail. Simply enable snapshots and then enable a
      tracer, and it will remove the snapshot.
      
      Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 45ad21ca ("tracing: Have trace_array keep track if snapshot buffer is allocated")
      Reported-by: default avatarRoss Zwisler <zwisler@kernel.org>
      Tested-by: default avatarRoss Zwisler <zwisler@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      a541a955
    • Steven Rostedt (Google)'s avatar
      ftrace: Still disable enabled records marked as disabled · cf04f2d5
      Steven Rostedt (Google) authored
      Weak functions started causing havoc as they showed up in the
      "available_filter_functions" and this confused people as to why some
      functions marked as "notrace" were listed, but when enabled they did
      nothing. This was because weak functions can still have fentry calls, and
      these addresses get added to the "available_filter_functions" file.
      kallsyms is what converts those addresses to names, and since the weak
      functions are not listed in kallsyms, it would just pick the function
      before that.
      
      To solve this, there was a trick to detect weak functions listed, and
      these records would be marked as DISABLED so that they do not get enabled
      and are mostly ignored. As the processing of the list of all functions to
      figure out what is weak or not can take a long time, this process is put
      off into a kernel thread and run in parallel with the rest of start up.
      
      Now the issue happens whet function tracing is enabled via the kernel
      command line. As it starts very early in boot up, it can be enabled before
      the records that are weak are marked to be disabled. This causes an issue
      in the accounting, as the weak records are enabled by the command line
      function tracing, but after boot up, they are not disabled.
      
      The ftrace records have several accounting flags and a ref count. The
      DISABLED flag is just one. If the record is enabled before it is marked
      DISABLED it will get an ENABLED flag and also have its ref counter
      incremented. After it is marked for DISABLED, neither the ENABLED flag nor
      the ref counter is cleared. There's sanity checks on the records that are
      performed after an ftrace function is registered or unregistered, and this
      detected that there were records marked as ENABLED with ref counter that
      should not have been.
      
      Note, the module loading code uses the DISABLED flag as well to keep its
      functions from being modified while its being loaded and some of these
      flags may get set in this process. So changing the verification code to
      ignore DISABLED records is a no go, as it still needs to verify that the
      module records are working too.
      
      Also, the weak functions still are calling a trampoline. Even though they
      should never be called, it is dangerous to leave these weak functions
      calling a trampoline that is freed, so they should still be set back to
      nops.
      
      There's two places that need to not skip records that have the ENABLED
      and the DISABLED flags set. That is where the ftrace_ops is processed and
      sets the records ref counts, and then later when the function itself is to
      be updated, and the ENABLED flag gets removed. Add a helper function
      "skip_record()" that returns true if the record has the DISABLED flag set
      but not the ENABLED flag.
      
      Link: https://lkml.kernel.org/r/20221005003809.27d2b97b@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: b39181f7 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      cf04f2d5
  8. Oct 04, 2022
  9. Oct 03, 2022
    • wuchi's avatar
      relay: use kvcalloc to alloc page array in relay_alloc_page_array · 83d87a4d
      wuchi authored
      kvcalloc() is safer because it will check the integer overflows, and using
      it will simple the logic of allocation size.
      
      Link: https://lkml.kernel.org/r/20220909101025.82955-1-wuchi.zero@gmail.com
      
      
      Signed-off-by: default avatarwuchi <wuchi.zero@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83d87a4d
    • Zach O'Keefe's avatar
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe authored
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com
      
      
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34488399
    • Alexander Potapenko's avatar
      bpf: kmsan: initialize BPF registers with zeroes · a6a7aaba
      Alexander Potapenko authored
      When executing BPF programs, certain registers may get passed
      uninitialized to helper functions.  E.g.  when performing a JMP_CALL,
      registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
      many of them are actually used.
      
      Passing uninitialized values as function parameters is technically
      undefined behavior, so we work around it by always initializing the
      registers.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6a7aaba
    • Alexander Potapenko's avatar
      entry: kmsan: introduce kmsan_unpoison_entry_regs() · 6cae637f
      Alexander Potapenko authored
      struct pt_regs passed into IRQ entry code is set up by uninstrumented asm
      functions, therefore KMSAN may not notice the registers are initialized.
      
      kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs,
      preventing potential false positives.  Unlike kmsan_unpoison_memory(), it
      can be called under kmsan_in_runtime(), which is often the case in IRQ
      entry code.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-41-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cae637f
    • Alexander Potapenko's avatar
      kcov: kmsan: unpoison area->list in kcov_remote_area_put() · 74d89909
      Alexander Potapenko authored
      KMSAN does not instrument kernel/kcov.c for performance reasons (with
      CONFIG_KCOV=y virtually every place in the kernel invokes kcov
      instrumentation).  Therefore the tool may miss writes from kcov.c that
      initialize memory.
      
      When CONFIG_DEBUG_LIST is enabled, list pointers from kernel/kcov.c are
      passed to instrumented helpers in lib/list_debug.c, resulting in false
      positives.
      
      To work around these reports, we unpoison the contents of area->list after
      initializing it.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-30-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74d89909
    • Alexander Potapenko's avatar
      dma: kmsan: unpoison DMA mappings · 7ade4f10
      Alexander Potapenko authored
      KMSAN doesn't know about DMA memory writes performed by devices.  We
      unpoison such memory when it's mapped to avoid false positive reports.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-22-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ade4f10
    • Alexander Potapenko's avatar
      kmsan: handle task creation and exiting · 50b5e49c
      Alexander Potapenko authored
      Tell KMSAN that a new task is created, so the tool creates a backing
      metadata structure for that task.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50b5e49c
    • Alexander Potapenko's avatar
      kmsan: disable instrumentation of unsupported common kernel code · 79dbd006
      Alexander Potapenko authored
      EFI stub cannot be linked with KMSAN runtime, so we disable
      instrumentation for it.
      
      Instrumenting kcov, stackdepot or lockdep leads to infinite recursion
      caused by instrumentation hooks calling instrumented code again.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-13-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79dbd006
    • Mike Kravetz's avatar
      hugetlb: add vma based lock for pmd sharing · 8d9bfb26
      Mike Kravetz authored
      Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
      synchronization use by vmas that could be involved in pmd sharing.  This
      data structure contains a rw semaphore that is the primary tool used for
      synchronization.
      
      This new structure is ref counted, so that it can exist when NOT attached
      to a vma.  This is only helpful in resolving lock ordering issues where
      code may need to obtain the vma_lock while there are no guarantees the vma
      may go away.  By obtaining a ref on the structure, it can be guaranteed
      that at least the rw semaphore will not go away.
      
      Only add infrastructure for the new lock here.  Actual use will be added
      in subsequent patches.
      
      [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
        Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d9bfb26
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
    • Beau Belgrave's avatar
      tracing/user_events: Move pages/locks into groups to prepare for namespaces · e5d27181
      Beau Belgrave authored
      In order to enable namespaces or any sort of isolation within
      user_events the register lock and pages need to be broken up into
      groups. Each event and file now has a group pointer which stores the
      actual pages to map, lookup data and synchronization objects.
      
      This only enables a single group that maps to init_user_ns, as IMA
      namespace has done. This enables user_events to start the work of
      supporting namespaces by walking the namespaces up to the init_user_ns.
      Future patches will address other user namespaces and will align to the
      approaches the IMA namespace uses.
      
      Link: https://lore.kernel.org/linux-kernel/20220915193221.1728029-15-stefanb@linux.ibm.com/#t
      Link: https://lkml.kernel.org/r/20221001001016.2832-2-beaub@linux.microsoft.com
      
      
      
      Signed-off-by: default avatarBeau Belgrave <beaub@linux.microsoft.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      e5d27181
    • Chen Zhongjin's avatar
      tracing: Remove unused variable 'dups' · ed87277f
      Chen Zhongjin authored
      Reported by Clang [-Wunused-but-set-variable]
      
      'commit c193707d ("tracing: Remove code which merges duplicates")'
      This commit removed the code which merges duplicates in detect_dups(),
      but forgot to delete the variable 'dups' which used to merge
      duplicates in the loop.
      
      Now only 'total_dups' is needed, remove 'dups' for clean code.
      
      Link: https://lkml.kernel.org/r/20220930103236.253985-1-chenzhongjin@huawei.com
      
      
      
      Signed-off-by: default avatarChen Zhongjin <chenzhongjin@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ed87277f
  10. Sep 30, 2022
  11. Sep 29, 2022
    • Jason A. Donenfeld's avatar
      utsname: contribute changes to RNG · 37608ba3
      Jason A. Donenfeld authored
      
      On some small machines with little entropy, a quasi-unique hostname is
      sometimes a relevant factor. I've seen, for example, 8 character
      alpha-numeric serial numbers. In addition, the time at which the hostname
      is set is usually a decent measurement of how long early boot took. So,
      call add_device_randomness() on new hostnames, which feeds its arguments
      to the RNG in addition to a fresh cycle counter.
      
      Low cost hooks like this never hurt and can only ever help, and since
      this costs basically nothing for an operation that is never a fast path,
      this is an overall easy win.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      37608ba3
    • Martin KaFai Lau's avatar
      bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline · 64696c40
      Martin KaFai Lau authored
      
      The struct_ops prog is to allow using bpf to implement the functions in
      a struct (eg. kernel module).  The current usage is to implement the
      tcp_congestion.  The kernel does not call the tcp-cc's ops (ie.
      the bpf prog) in a recursive way.
      
      The struct_ops is sharing the tracing-trampoline's enter/exit
      function which tracks prog->active to avoid recursion.  It is
      needed for tracing prog.  However, it turns out the struct_ops
      bpf prog will hit this prog->active and unnecessarily skipped
      running the struct_ops prog.  eg.  The '.ssthresh' may run in_task()
      and then interrupted by softirq that runs the same '.ssthresh'.
      Skip running the '.ssthresh' will end up returning random value
      to the caller.
      
      The patch adds __bpf_prog_{enter,exit}_struct_ops for the
      struct_ops trampoline.  They do not track the prog->active
      to detect recursion.
      
      One exception is when the tcp_congestion's '.init' ops is doing
      bpf_setsockopt(TCP_CONGESTION) and then recurs to the same
      '.init' ops.  This will be addressed in the following patches.
      
      Fixes: ca06f55b ("bpf: Add per-program recursion prevention mechanism")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      64696c40
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix race between reset page and reading page · a0fcaaed
      Steven Rostedt (Google) authored
      The ring buffer is broken up into sub buffers (currently of page size).
      Each sub buffer has a pointer to its "tail" (the last event written to the
      sub buffer). When a new event is requested, the tail is locally
      incremented to cover the size of the new event. This is done in a way that
      there is no need for locking.
      
      If the tail goes past the end of the sub buffer, the process of moving to
      the next sub buffer takes place. After setting the current sub buffer to
      the next one, the previous one that had the tail go passed the end of the
      sub buffer needs to be reset back to the original tail location (before
      the new event was requested) and the rest of the sub buffer needs to be
      "padded".
      
      The race happens when a reader takes control of the sub buffer. As readers
      do a "swap" of sub buffers from the ring buffer to get exclusive access to
      the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
      that goes back into the writable portion of the ring buffer. This swap can
      happen as soon as the writer moves to the next sub buffer and before it
      updates the last sub buffer with padding.
      
      Because the sub buffer can be released to the reader while the writer is
      still updating the padding, it is possible for the reader to see the event
      that goes past the end of the sub buffer. This can cause obvious issues.
      
      To fix this, add a few memory barriers so that the reader definitely sees
      the updates to the sub buffer, and also waits until the writer has put
      back the "tail" of the sub buffer back to the last event that was written
      on it.
      
      To be paranoid, it will only spin for 1 second, otherwise it will
      warn and shutdown the ring buffer code. 1 second should be enough as
      the writer does have preemption disabled. If the writer doesn't move
      within 1 second (with preemption disabled) something is horribly
      wrong. No interrupt should last 1 second!
      
      Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369
      Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home
      
      
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: c7b09308 ("ring-buffer: prevent adding write in discarded area")
      Reported-by: default avatarJiazi.Li <jiazi.li@transsion.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      a0fcaaed
Loading