- Dec 02, 2022
-
-
Kees Cook authored
Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
-
Kees Cook authored
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Marco Elver <elver@google.com> Reviewed-by:
Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
-
Kees Cook authored
In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by:
Kees Cook <keescook@chromium.org>
-
- Dec 01, 2022
-
-
Kees Cook authored
Since Oops count is now tracked and is a fairly interesting signal, add the entry /sys/kernel/oops_count to expose it to userspace. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org
-
Jann Horn authored
Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by:
Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
-
Kees Cook authored
In preparation for adding more sysctls directly in kernel/panic.c, split CONFIG_SMP from the logic that adds sysctls. Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org
-
- Nov 01, 2022
-
-
Kees Cook authored
A common exploit pattern for ROP attacks is to abuse prepare_kernel_cred() in order to construct escalated privileges[1]. Instead of providing a short-hand argument (NULL) to the "daemon" argument to indicate using init_cred as the base cred, require that "daemon" is always set to an actual task. Replace all existing callers that were passing NULL with &init_task. Future attacks will need to have sufficiently powerful read/write primitives to have found an appropriately privileged task and written it to the ROP stack as an argument to succeed, which is similarly difficult to the prior effort needed to escalate privileges before struct cred existed: locate the current cred and overwrite the uid member. This has the added benefit of meaning that prepare_kernel_cred() can no longer exceed the privileges of the init task, which may have changed from the original init_cred (e.g. dropping capabilities from the bounding set). [1] https://google.com/search?q=commit_creds(prepare_kernel_cred(0)) Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Howells <dhowells@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Steve French <sfrench@samba.org> Cc: Ronnie Sahlberg <lsahlber@redhat.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by:
Kees Cook <keescook@chromium.org> Acked-by:
Luis Chamberlain <mcgrof@kernel.org> Reviewed-by:
Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by:
Russ Weight <russell.h.weight@intel.com> Acked-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by:
Paulo Alcantara (SUSE) <pc@cjr.nz> Link: https://lore.kernel.org/r/20221026232943.never.775-kees@kernel.org
-
- Oct 12, 2022
-
-
Steven Rostedt (Google) authored
The follow commands caused a crash: # cd /sys/kernel/tracing # echo 's:open char file[]' > dynamic_events # echo 'hist:keys=common_pid:file=filename:onchange($file).trace(open,$file)' > events/syscalls/sys_enter_openat/trigger' # echo 1 > events/synthetic/open/enable BOOM! The problem is that the synthetic event field "char file[]" will read the value given to it as a string without any memory checks to make sure the address is valid. The above example will pass in the user space address and the sythetic event code will happily call strlen() on it and then strscpy() where either one will cause an oops when accessing user space addresses. Use the helper functions from trace_kprobe and trace_eprobe that can read strings safely (and actually succeed when the address is from user space and the memory is mapped in). Now the above can show: packagekitd-1721 [000] ...2. 104.597170: open: file=/usr/lib/rpm/fileattrs/cmake.attr in:imjournal-978 [006] ...2. 104.599642: open: file=/var/lib/rsyslog/imjournal.state.tmp packagekitd-1721 [000] ...2. 104.626308: open: file=/usr/lib/rpm/fileattrs/debuginfo.attr Link: https://lkml.kernel.org/r/20221012104534.826549315@goodmis.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by:
Tom Zanussi <zanussi@kernel.org> Fixes: bd82631d ("tracing: Add support for dynamic strings to synthetic events") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Have the specific functions for kernel probes that read strings to inject the "(fault)" name directly. trace_probes.c does this too (for uprobes) but as the code to read strings are going to be used by synthetic events (and perhaps other utilities), it simplifies the code by making sure those other uses do not need to implement the "(fault)" name injection as well. Link: https://lkml.kernel.org/r/20221012104534.644803645@goodmis.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by:
Tom Zanussi <zanussi@kernel.org> Fixes: bd82631d ("tracing: Add support for dynamic strings to synthetic events") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The functions: fetch_store_strlen_user() fetch_store_strlen() fetch_store_string_user() fetch_store_string() are identical in both trace_kprobe.c and trace_eprobe.c. Move them into a new header file trace_probe_kernel.h to share it. This code will later be used by the synthetic events as well. Marked for stable as a fix for a crash in synthetic events requires it. Link: https://lkml.kernel.org/r/20221012104534.467668078@goodmis.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by:
Tom Zanussi <zanussi@kernel.org> Fixes: bd82631d ("tracing: Add support for dynamic strings to synthetic events") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Jiapeng Chong authored
kernel/trace/ring_buffer.c:895: warning: expecting prototype for ring_buffer_nr_pages_dirty(). Prototype was for ring_buffer_nr_dirty_pages() instead. kernel/trace/ring_buffer.c:5313: warning: expecting prototype for ring_buffer_reset_cpu(). Prototype was for ring_buffer_reset_online_cpus() instead. kernel/trace/ring_buffer.c:5382: warning: expecting prototype for rind_buffer_empty(). Prototype was for ring_buffer_empty() instead. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2340 Link: https://lkml.kernel.org/r/20221009020642.12506-1-jiapeng.chong@linux.alibaba.com Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Signed-off-by:
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Zheng Yejian authored
When ftrace bug happened, following log shows every hex data in problematic ip address: actual: ffffffe8:6b:ffffffd9:01:21 But so many 'f's seem a little confusing, and that is because format '%x' being used to print signed chars in array 'ins'. As suggested by Joe, change to use format "%*phC" to print array 'ins'. After this patch, the log is like: actual: e8:6b:d9:01:21 Link: https://lkml.kernel.org/r/20221011120352.1878494-1-zhengyejian1@huawei.com Fixes: 6c14133d ("ftrace: Do not blindly read the ip address in ftrace_bug()") Suggested-by:
Joe Perches <joe@perches.com> Signed-off-by:
Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Xu Panda authored
linux/sched/mm.h is included more than once. Link: https://lkml.kernel.org/r/20220912071556.16811-1-xu.panda@zte.com.cn Signed-off-by:
Xu Panda <xu.panda@zte.com.cn> Reported-by:
Zeal Robot <zealci@zte.com.cn> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Oct 11, 2022
-
-
Jason A. Donenfeld authored
The prandom_bytes() function has been a deprecated inline wrapper around get_random_bytes() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. This was done as a basic find and replace. Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Yury Norov <yury.norov@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc Acked-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
Jason A. Donenfeld authored
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by:
Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
Jason A. Donenfeld authored
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Yury Norov <yury.norov@gmail.com> Reviewed-by:
KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by:
Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
- Oct 10, 2022
-
-
Aaron Tomlin authored
This ensures that no module record/or entry is added to the unloaded_tainted_modules list if it does not carry a taint. Reported-by:
Alexey Dobriyan <adobriyan@gmail.com> Fixes: 99bd9956 ("module: Introduce module unload taint tracking") Signed-off-by:
Aaron Tomlin <atomlin@redhat.com> Acked-by:
Luis Chamberlain <mcgrof@kernel.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Oct 06, 2022
-
-
Valentin Schneider authored
This removes the second use of the sched_core_mask temporary mask. Suggested-by:
Yury Norov <yury.norov@gmail.com> Signed-off-by:
Valentin Schneider <vschneid@redhat.com>
-
Steven Rostedt (Google) authored
The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the snapshot buffer at boot up for use later. The ftrace_boot_snapshot in particular requires the snapshot to be allocated because it will take a snapshot at the end of boot up allowing to see the traces that happened during boot so that it's not lost when user space takes over. When a tracer is registered (started) there's a path that checks if it requires the snapshot buffer or not, and if it does not and it was allocated it will do a synchronization and free the snapshot buffer. This is only required if the previous tracer was using it for "max latency" snapshots, as it needs to make sure all max snapshots are complete before freeing. But this is only needed if the previous tracer was using the snapshot buffer for latency (like irqoff tracer and friends). But it does not make sense to free it, if the previous tracer was not using it, and the snapshot was allocated by the cmdline parameters. This basically takes away the point of allocating it in the first place! Note, the allocated snapshot worked fine for just trace events, but fails when a tracer is enabled on the cmdline. Further investigation, this goes back even further and it does not require a tracer on the cmdline to fail. Simply enable snapshots and then enable a tracer, and it will remove the snapshot. Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: 45ad21ca ("tracing: Have trace_array keep track if snapshot buffer is allocated") Reported-by:
Ross Zwisler <zwisler@kernel.org> Tested-by:
Ross Zwisler <zwisler@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Weak functions started causing havoc as they showed up in the "available_filter_functions" and this confused people as to why some functions marked as "notrace" were listed, but when enabled they did nothing. This was because weak functions can still have fentry calls, and these addresses get added to the "available_filter_functions" file. kallsyms is what converts those addresses to names, and since the weak functions are not listed in kallsyms, it would just pick the function before that. To solve this, there was a trick to detect weak functions listed, and these records would be marked as DISABLED so that they do not get enabled and are mostly ignored. As the processing of the list of all functions to figure out what is weak or not can take a long time, this process is put off into a kernel thread and run in parallel with the rest of start up. Now the issue happens whet function tracing is enabled via the kernel command line. As it starts very early in boot up, it can be enabled before the records that are weak are marked to be disabled. This causes an issue in the accounting, as the weak records are enabled by the command line function tracing, but after boot up, they are not disabled. The ftrace records have several accounting flags and a ref count. The DISABLED flag is just one. If the record is enabled before it is marked DISABLED it will get an ENABLED flag and also have its ref counter incremented. After it is marked for DISABLED, neither the ENABLED flag nor the ref counter is cleared. There's sanity checks on the records that are performed after an ftrace function is registered or unregistered, and this detected that there were records marked as ENABLED with ref counter that should not have been. Note, the module loading code uses the DISABLED flag as well to keep its functions from being modified while its being loaded and some of these flags may get set in this process. So changing the verification code to ignore DISABLED records is a no go, as it still needs to verify that the module records are working too. Also, the weak functions still are calling a trampoline. Even though they should never be called, it is dangerous to leave these weak functions calling a trampoline that is freed, so they should still be set back to nops. There's two places that need to not skip records that have the ENABLED and the DISABLED flags set. That is where the ftrace_ops is processed and sets the records ref counts, and then later when the function itself is to be updated, and the ENABLED flag gets removed. Add a helper function "skip_record()" that returns true if the record has the DISABLED flag set but not the ENABLED flag. Link: https://lkml.kernel.org/r/20221005003809.27d2b97b@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: b39181f7 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Oct 04, 2022
-
-
Kai-Heng Feng authored
Add SYS_OFF_MODE_RESTART_PREPARE callbacks to be invoked before a system restart. Suggested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Reviewed-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Kai-Heng Feng <kai.heng.feng@canonical.com> [ rjw: Changelog edits ] Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Peter Zijlstra authored
Perf fuzzer gifted a lockdep splat: perf_event_init_context() mutex_lock(parent_ctx->mutex); (B) inherit_task_group() inherit_group() inherit_event() perf_event_alloc() perf_try_init_event() := hw_breakpoint_event_init() register_perf_hw_breakpoint() mutex_lock(child->perf_event_mutex); (A) Which is against the normal (documented) order. Now, this is a false positive in that child is not published yet, but also inherited events never end up on ->perf_event_list. Annotate this one away. Fixes: 0912037f ("perf/hw_breakpoint: Reduce contention with large number of tasks") Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org>
-
Peter Zijlstra authored
Mark reported that the new for_each_sibling_event() assertion triggers in pmu_filter_match() -- which isn't always called with IRQs disabled or ctx->mutex held. Fixes: f3c0eba2 ("perf: Add a few assertions") Reported-by:
Mark Rutland <mark.rutland@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YvvJq2f/7eFVcnNy@FVFF77S0Q05N
-
- Oct 03, 2022
-
-
wuchi authored
kvcalloc() is safer because it will check the integer overflows, and using it will simple the logic of allocation size. Link: https://lkml.kernel.org/r/20220909101025.82955-1-wuchi.zero@gmail.com Signed-off-by:
wuchi <wuchi.zero@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Zach O'Keefe authored
Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). On success, the backing memory will be a hugepage. For the memory range and process provided, the page tables will synchronously have a huge pmd installed, mapping the THP. Other mappings of the file extent mapped by the memory range may be added to a set of entries that khugepaged will later process and attempt update their page tables to map the THP by a pmd. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Since khugepaged is single threaded, this change now introduces possibility of collapse contexts racing in file collapse path. There a important few places to consider: (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. We could have the memory collapsed out from under us, but the next xas_for_each() iteration will correctly pick up the hugepage. The hugepage might not be up to date (insofar as copying of small page contents might not have completed - the page still may be locked), but regardless what small page index we were iterating over, we'll find the hugepage and identify it as a suitably aligned compound page of order HPAGE_PMD_ORDER. In khugepaged path, we locklessly check the value of the pmd, and only add it to deferred collapse array if we find pmd mapping pte table. This is fine, since other values that could have raced in right afterwards denote failure, or that the memory was successfully collapsed, so we don't need further processing. In madvise path, we'll take mmap_lock() in write to serialize against page table updates and will know what to do based on the true value of the pmd: recheck all ptes if we point to a pte table, directly install the pmd, if the pmd has been cleared, but memory not yet faulted, or nothing at all if we find a huge pmd. It's worth putting emphasis here on how we treat the none pmd here. If khugepaged has processed this mm's page tables already, it will have left the pmd cleared (ready for refault by the process). Depending on the VMA flags and sysfs settings, amount of RAM on the machine, and the current load, could be a relatively common occurrence - and as such is one we'd like to handle successfully in MADV_COLLAPSE. When we see the none pmd in collapse_pte_mapped_thp(), we've locked mmap_lock in write and checked (a) huepaged_vma_check() to see if the backing memory is appropriate still, along with VMA sizing and appropriate hugepage alignment within the file, and (b) we've found a hugepage head of order HPAGE_PMD_ORDER at the offset in the file mapped by our hugepage-aligned virtual address. Even though the common-case is likely race with khugepaged, given these checks (regardless how we got here - we could be operating on a completely different file than originally checked in hpage_collapse_scan_file() for all we know) it should be safe to directly make the pmd a huge pmd pointing to this hugepage. (2) collapse_file() is mostly serialized on the same file extent by lock sequence: | lock hupepage | lock mapping->i_pages | lock 1st page | unlock mapping->i_pages | <page checks> | lock mapping->i_pages | page_ref_freeze(3) | xas_store(hugepage) | unlock mapping->i_pages | page_ref_unfreeze(1) | unlock 1st page V unlock hugepage Once a context (who already has their fresh hugepage locked) locks mapping->i_pages exclusively, it will hold said lock until it locks the first page, and it will hold that lock until the after the hugepage has been added to the page cache (and will unlock the hugepage after page table update, though that isn't important here). A racing context that loses the race for mapping->i_pages will then lose the race to locking the first page. Here - depending on how far the other racing context has gotten - we might find the new hugepage (in which case we'll exit cleanly when we check PageTransCompound()), or we'll find the "old" 1st small page (in which we'll exit cleanly when we discover unexpected refcount of 2 after isolate_lru_page()). This is assuming we are able to successfully lock the page we find - in shmem path, we could just fail the trylock and exit cleanly anyways. Failure path in collapse_file() is similar: once we hold lock on 1st small page, we are serialized against other collapse contexts. Before the 1st small page is unlocked, we add it back to the pagecache and unfreeze the refcount appropriately. Contexts who lost the race to the 1st small page will then find the same 1st small page with the correct refcount and will be able to proceed. [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()] Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove check for multi-add in khugepaged_add_pte_mapped_thp()] Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com Signed-off-by:
Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
When executing BPF programs, certain registers may get passed uninitialized to helper functions. E.g. when performing a JMP_CALL, registers BPF_R1-BPF_R5 are always passed to the helper, no matter how many of them are actually used. Passing uninitialized values as function parameters is technically undefined behavior, so we work around it by always initializing the registers. Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
struct pt_regs passed into IRQ entry code is set up by uninstrumented asm functions, therefore KMSAN may not notice the registers are initialized. kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs, preventing potential false positives. Unlike kmsan_unpoison_memory(), it can be called under kmsan_in_runtime(), which is often the case in IRQ entry code. Link: https://lkml.kernel.org/r/20220915150417.722975-41-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
KMSAN does not instrument kernel/kcov.c for performance reasons (with CONFIG_KCOV=y virtually every place in the kernel invokes kcov instrumentation). Therefore the tool may miss writes from kcov.c that initialize memory. When CONFIG_DEBUG_LIST is enabled, list pointers from kernel/kcov.c are passed to instrumented helpers in lib/list_debug.c, resulting in false positives. To work around these reports, we unpoison the contents of area->list after initializing it. Link: https://lkml.kernel.org/r/20220915150417.722975-30-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
KMSAN doesn't know about DMA memory writes performed by devices. We unpoison such memory when it's mapped to avoid false positive reports. Link: https://lkml.kernel.org/r/20220915150417.722975-22-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
Tell KMSAN that a new task is created, so the tool creates a backing metadata structure for that task. Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Alexander Potapenko authored
EFI stub cannot be linked with KMSAN runtime, so we disable instrumentation for it. Instrumenting kcov, stackdepot or lockdep leads to infinite recursion caused by instrumentation hooks calling instrumented code again. Link: https://lkml.kernel.org/r/20220915150417.722975-13-glider@google.com Signed-off-by:
Alexander Potapenko <glider@google.com> Reviewed-by:
Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Mike Kravetz authored
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by:
Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by:
Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Saves several calls to compound_head(). Link: https://lkml.kernel.org/r/20220902194653.1739778-57-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Remove a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20220902194653.1739778-45-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Beau Belgrave authored
In order to enable namespaces or any sort of isolation within user_events the register lock and pages need to be broken up into groups. Each event and file now has a group pointer which stores the actual pages to map, lookup data and synchronization objects. This only enables a single group that maps to init_user_ns, as IMA namespace has done. This enables user_events to start the work of supporting namespaces by walking the namespaces up to the init_user_ns. Future patches will address other user namespaces and will align to the approaches the IMA namespace uses. Link: https://lore.kernel.org/linux-kernel/20220915193221.1728029-15-stefanb@linux.ibm.com/#t Link: https://lkml.kernel.org/r/20221001001016.2832-2-beaub@linux.microsoft.com Signed-off-by:
Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Chen Zhongjin authored
Reported by Clang [-Wunused-but-set-variable] 'commit c193707d ("tracing: Remove code which merges duplicates")' This commit removed the code which merges duplicates in detect_dups(), but forgot to delete the variable 'dups' which used to merge duplicates in the loop. Now only 'total_dups' is needed, remove 'dups' for clean code. Link: https://lkml.kernel.org/r/20220930103236.253985-1-chenzhongjin@huawei.com Signed-off-by:
Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Sep 30, 2022
-
-
Peter Zijlstra authored
Boris reported hung_task splats after commit 5aec788a ("sched: Fix TASK_state comparisons"). Upon closer consideration of that change it doesn't only exclude TASK_KILLABLE, but also TASK_IDLE. Update the comment to reflect this fact and add an additional TASK_NOLOAD test to exclude them. Additionally, remove the TASK_FREEZABLE early exit from check_hung_task(), a freezable task is not a frozen task. Fixes: 5aec788a ("sched: Fix TASK_state comparisons") Reported-by:
Borislav Petkov <bp@alien8.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by:
Borislav Petkov <bp@alien8.de>
-
- Sep 29, 2022
-
-
Jason A. Donenfeld authored
On some small machines with little entropy, a quasi-unique hostname is sometimes a relevant factor. I've seen, for example, 8 character alpha-numeric serial numbers. In addition, the time at which the hostname is set is usually a decent measurement of how long early boot took. So, call add_device_randomness() on new hostnames, which feeds its arguments to the RNG in addition to a fresh cycle counter. Low cost hooks like this never hurt and can only ever help, and since this costs basically nothing for an operation that is never a fast path, this is an overall easy win. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Reviewed-by:
Kees Cook <keescook@chromium.org> Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
Martin KaFai Lau authored
The struct_ops prog is to allow using bpf to implement the functions in a struct (eg. kernel module). The current usage is to implement the tcp_congestion. The kernel does not call the tcp-cc's ops (ie. the bpf prog) in a recursive way. The struct_ops is sharing the tracing-trampoline's enter/exit function which tracks prog->active to avoid recursion. It is needed for tracing prog. However, it turns out the struct_ops bpf prog will hit this prog->active and unnecessarily skipped running the struct_ops prog. eg. The '.ssthresh' may run in_task() and then interrupted by softirq that runs the same '.ssthresh'. Skip running the '.ssthresh' will end up returning random value to the caller. The patch adds __bpf_prog_{enter,exit}_struct_ops for the struct_ops trampoline. They do not track the prog->active to detect recursion. One exception is when the tcp_congestion's '.init' ops is doing bpf_setsockopt(TCP_CONGESTION) and then recurs to the same '.init' ops. This will be addressed in the following patches. Fixes: ca06f55b ("bpf: Add per-program recursion prevention mechanism") Signed-off-by:
Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Steven Rostedt (Google) authored
The ring buffer is broken up into sub buffers (currently of page size). Each sub buffer has a pointer to its "tail" (the last event written to the sub buffer). When a new event is requested, the tail is locally incremented to cover the size of the new event. This is done in a way that there is no need for locking. If the tail goes past the end of the sub buffer, the process of moving to the next sub buffer takes place. After setting the current sub buffer to the next one, the previous one that had the tail go passed the end of the sub buffer needs to be reset back to the original tail location (before the new event was requested) and the rest of the sub buffer needs to be "padded". The race happens when a reader takes control of the sub buffer. As readers do a "swap" of sub buffers from the ring buffer to get exclusive access to the sub buffer, it replaces the "head" sub buffer with an empty sub buffer that goes back into the writable portion of the ring buffer. This swap can happen as soon as the writer moves to the next sub buffer and before it updates the last sub buffer with padding. Because the sub buffer can be released to the reader while the writer is still updating the padding, it is possible for the reader to see the event that goes past the end of the sub buffer. This can cause obvious issues. To fix this, add a few memory barriers so that the reader definitely sees the updates to the sub buffer, and also waits until the writer has put back the "tail" of the sub buffer back to the last event that was written on it. To be paranoid, it will only spin for 1 second, otherwise it will warn and shutdown the ring buffer code. 1 second should be enough as the writer does have preemption disabled. If the writer doesn't move within 1 second (with preemption disabled) something is horribly wrong. No interrupt should last 1 second! Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369 Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: c7b09308 ("ring-buffer: prevent adding write in discarded area") Reported-by:
Jiazi.Li <jiazi.li@transsion.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-