- Apr 07, 2020
-
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200302224851.GA26467@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200302224501.GA14175@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Gustavo A. R. Silva authored
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by:
Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Peter Oberparleiter <oberpar@linux.ibm.com> Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Qiujun Huang authored
There is a typo in comment. Fix it. s/assuems/assumes/ Signed-off-by:
Qiujun Huang <hqjagain@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Luis Chamberlain <mcgrof@kernel.org> Link: http://lkml.kernel.org/r/1585891029-6450-1-git-send-email-hqjagain@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Will Deacon authored
kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to modules despite having no in-tree users and being wide open to abuse by out-of-tree modules that can use them as a method to invoke arbitrary non-exported kernel functions. Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol(). Signed-off-by:
Will Deacon <will@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by:
Quentin Perret <qperret@google.com> Acked-by:
Alexei Starovoitov <ast@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Masahiro Yamada authored
Commit ac7c3e4f ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by:
Masahiro Yamada <masahiroy@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by:
Nathan Chancellor <natechancellor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: David Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Nathan Chancellor authored
Clang warns: ../kernel/extable.c:37:52: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by:
Nick Desaulniers <ndesaulniers@google.com> Signed-off-by:
Nathan Chancellor <natechancellor@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Link: https://github.com/ClangBuiltLinux/linux/issues/892 Link: http://lkml.kernel.org/r/20200219202036.45702-1-natechancellor@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Alexey Dobriyan authored
Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by:
kbuild test robot <lkp@intel.com> Reported-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
This replaces all remaining open encodings with is_vm_hugetlb_page(). Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Burton <paulburton@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Geert Uytterhoeven <geert@linux-m68k.org> Acked-by:
Guo Ren <guoren@kernel.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Xinhai authored
Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the new duplicated vma. Currently, only in fork path there are misuse for handling anon_vma. No other bugs been revealed with this patch applied. Signed-off-by:
Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Xinhai authored
Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path". This patchset fixes the misuse of parenet anon_vma, which mainly caused by child vma's vm_next and vm_prev are left same as its parent after duplicate vma. Finally, code reached parent vma's neighbor by referring pointer of child vma and executed wrong logic. The first two patches fix relevant issues, and the third patch sets vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse in future. Effects of the first bug is that causes rmap code to check both parent and child's page table, although a page couldn't be mapped by both parent and child, because child vma has WIPEONFORK so all pages mapped by child are 'new' and not relevant to parent. Effects of the second bug is that the relationship of anon_vma of parent and child are totallyconvoluted. It would cause 'son', 'grandson', ..., etc, to share 'parent' anon_vma, which disobey the design rule of reusing anon_vma (the rule to be followed is that reusing should among vma of same process, and vma should not gone through fork). So, both issues should cause unnecessary rmap walking and have unexpected complexity. These two issues would not be directly visible, I used debugging code to check the anon_vma pointers of parent and child when inspecting the suspicious implementation of issue #2, then find the problem. This patch (of 3): In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That allows anon_vma used by parent been mistakenly shared by child (find_mergeable_anon_vma() will do this reuse work). Besides this issue, call anon_vma_prepare() should be avoided because we don't copy page for this vma. Preparing anon_vma will be handled during fault. Fixes: d2cd9ede ("mm,fork: introduce MADV_WIPEONFORK") Signed-off-by:
Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 03, 2020
-
-
Waiman Long authored
The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount option that make cpuset.cpus and cpuset.mems behave more like those in cgroup v2. Document it to make other people more aware of this feature that can be useful in some circumstances. Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
This reverts commit a49e4629 ("cpuset: Make cpuset hotplug synchronous") as it may deadlock with cpu hotplug path. Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Qian Cai <cai@lca.pw> Cc: Prateek Sood <prsood@codeaurora.org>
-
Steven Rostedt (VMware) authored
When dumping out the trace data in latency format, a check is made to peek at the next event to compare its timestamp to the current one, and if the delta is of a greater size, it will add a marker showing so. But to do this, it needs to save the current event otherwise peeking at the next event will remove the current event. To save the event, a temp buffer is used, and if the event is bigger than the temp buffer, the temp buffer is freed and a bigger buffer is allocated. This allocation is a problem when called in atomic context. The only way this gets called via atomic context is via ftrace_dump(). Thus, use a static buffer of 128 bytes (which covers most events), and if the event is bigger than that, simply return NULL. The callers of trace_find_next_entry() need to handle a NULL case, as that's what would happen if the allocation failed. Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian Fixes: ff895103 ("tracing: Save off entry when peeking at next entry") Reported-by:
kernel test robot <rong.a.chen@intel.com> Signed-off-by:
Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- Apr 02, 2020
-
-
Sebastian Andrzej Siewior authored
Since commit 5bbe3547 ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages and compact them by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and issue a warning on RT if it is changed. [bigeasy@linutronix.de: v5] Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Mel Gorman <mgorman@techsingularity.net> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Sebastian Andrzej Siewior authored
The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the `extra*' attribues have been set properly but without proc_dointvec_minmax() as the `proc_handler' the limit will not be enforced. Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid specified range. Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Tejun Heo <tj@kernel.org> Acked-by:
Roman Gushchin <guro@fb.com> Acked-by:
Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by:
Roman Gushchin <guro@fb.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Shakeel Butt <shakeelb@google.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Chen Yu authored
Debug messages from the system suspend/hibernation infrastructure are disabled by default, and can only be enabled after the system has boot up via /sys/power/pm_debug_messages. This makes the hibernation resume hard to track as it involves system boot up across hibernation. There's no chance for software_resume() to track the resume process, for example. Add a kernel command line option to set pm_debug_messages during boot up. Signed-off-by:
Chen Yu <yu.c.chen@intel.com> [ rjw: Subject & changelog ] Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- Apr 01, 2020
-
-
Eric W. Biederman authored
Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
Daniel Thompson authored
Currently the PROMPT variable could be abused to provoke the printf() machinery to read outside the current stack frame. Normally this doesn't matter becaues md is already a much better tool for reading from memory. However the md command can be disabled by not setting KDB_ENABLE_MEM_READ. Let's also prevent PROMPT from being modified in these circumstances. Whilst adding a comment to help future code reviewers we also remove the #ifdef where PROMPT in consumed. There is no problem passing an unused (0) to snprintf when !CONFIG_SMP. argument Reported-by:
Wang Xiayang <xywang.sjtu@sjtu.edu.cn> Signed-off-by:
Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by:
Douglas Anderson <dianders@chromium.org>
-
Daniel Thompson authored
Currently the code to manage the kdb history buffer uses strncpy() to copy strings to/and from the history and exhibits the classic "but nobody ever told me that strncpy() doesn't always terminate strings" bug. Modern gcc compilers recognise this bug and issue a warning. In reality these calls will only abridge the copied string if kdb_read() has *already* overflowed the command buffer. Thus the use of counted copies here is only used to reduce the secondary effects of a bug elsewhere in the code. Therefore transitioning these calls into strscpy() (without checking the return code) is appropriate. Signed-off-by:
Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by:
Douglas Anderson <dianders@chromium.org>
-
Sebastian Andrzej Siewior authored
A hrtimer can be released in its callback, but lockdep_hrtimer_exit() dereferences the pointer after the callback returns, i.e. a potential use after free. Retrieve the context in which the hrtimer expires before the callback is invoked and use it in lockdep_hrtimer_exit(). Fixes: 40db1739 ("lockdep: Add hrtimer context tracing bits") Reported-by:
<syzbot+62c155c276e580cfb606@syzkaller.appspotmail.com> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200331201849.fkp2siy3vcdqvqlz@linutronix.de
-
Dexuan Cui authored
If hibernation_restore() fails, the 'error' should not be zero. Signed-off-by:
Dexuan Cui <decui@microsoft.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- Mar 31, 2020
-
-
Andrii Nakryiko authored
Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by:
Andrii Nakryiko <andriin@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
-
Andrii Nakryiko authored
Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by:
Andrii Nakryiko <andriin@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Acked-by:
Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
-
- Mar 30, 2020
-
-
John Fastabend authored
Further refine return values range in do_refine_retval_range by noting these are int return types (We will assume here that int is a 32-bit type). Two reasons to pull this out of original patch. First it makes the original fix impossible to backport. And second I've not seen this as being problematic in practice unlike the other case. Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
-
John Fastabend authored
It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] The expected 64-bit and 32-bit bounds after each line are shown on the right. The current issue is without the w* bounds we are forced to use the worst case bound of [0, U32_MAX]. To resolve this type of case, jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit 32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then from branch_taken logic creating new bounds we can track 32-bit bounds explicitly. The next case we observed is ALU ops after the jmp32, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] 55: w8 += 1 // r8 <- [0, U32_MAX+1] // w8 <- [0, X+1] In order to keep the bounds accurate at this point we also need to track ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU ops, mov, add, sub, etc. Finally there is a question of how and when to merge bounds. The cases enumerate here, 1. MOV ALU32 - zext 32-bit -> 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(<<var_off)). See __reg_combine_64_into_32(). Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
-
John Fastabend authored
do_refine_retval_range() is called to refine return values from specified helpers, probe_read_str and get_stack at the moment, the reasoning is because both have a max value as part of their input arguments and because the helper ensure the return value will not be larger than this we can set smax values of the return register, r0. However, the return value is a signed integer so setting umax is incorrect It leads to further confusion when the do_refine_retval_range() then calls, __reg_deduce_bounds() which will see a umax value as meaning the value is unsigned and then assuming it is unsigned set the smin = umin which in this case results in 'smin = 0' and an 'smax = X' where X is the input argument from the helper call. Here are the comments from _reg_deduce_bounds() on why this would be safe to do. /* Learn sign from unsigned bounds. Signed bounds cross the sign * boundary, so we must be careful. */ if ((s64)reg->umax_value >= 0) { /* Positive. We can't learn anything from the smin, but smax * is positive, hence safe. */ reg->smin_value = reg->umin_value; reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value, reg->umax_value); But now we incorrectly have a return value with type int with the signed bounds (0,X). Suppose the return value is negative, which is possible the we have the verifier and reality out of sync. Among other things this may result in any error handling code being falsely detected as dead-code and removed. For instance the example below shows using bpf_probe_read_str() causes the error path to be identified as dead code and removed. >From the 'llvm-object -S' dump, r2 = 100 call 45 if r0 s< 0 goto +4 r4 = *(u32 *)(r7 + 0) But from dump xlate (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (61) r4 = *(u32 *)(r7 +0) <-- dropped if goto Due to verifier state after call being R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f)) To fix omit setting the umax value because its not safe. The only actual bounds we know is the smax. This results in the correct bounds (SMIN, X) where X is the max length from the helper. After this the new verifier state looks like the following after call 45. R0=inv(id=0,smax_value=100) Then xlated version no longer removed dead code giving the expected result, (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (c5) if r0 s< 0x0 goto pc+4 (61) r4 = *(u32 *)(r7 +0) Note, bpf_probe_read_* calls are root only so we wont hit this case with non-root bpf users. v3: comment had some documentation about meta set to null case which is not relevant here and confusing to include in the comment. v2 note: In original version we set msize_smax_value from check_func_arg() and propagated this into smax of retval. The logic was smax is the bound on the retval we set and because the type in the helper is ARG_CONST_SIZE we know that the reg is a positive tnum_const() so umax=smax. Alexei pointed out though this is a bit odd to read because the register in check_func_arg() has a C type of u32 and the umax bound would be the normally relavent bound here. Pulling in extra knowledge about future checks makes reading the code a bit tricky. Further having a signed meta data that can only ever be positive is also a bit odd. So dropped the msize_smax_value metadata and made it a u64 msize_max_value to indicate its unsigned. And additionally save bound from umax value in check_arg_funcs which is the same as smax due to as noted above tnumx_cont and negative check but reads better. By my analysis nothing functionally changes in v2 but it does get easier to read so that is win. Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
-
KP Singh authored
The bounds checking for the arguments accessed in the BPF program breaks when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or BPF_MODIFY_RETURN resulting in no check being done for the default case (the programs which do not receive the return value of the attached function in its arguments) when the index of the argument being accessed is equal to the number of arguments (nr_args). This was a result of a misplaced "else if" block introduced by the Commit 6ba43b76 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Fixes: 6ba43b76 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Reported-by:
Jann Horn <jannh@google.com> Signed-off-by:
KP Singh <kpsingh@google.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org
-
Jann Horn authored
reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(), just flipped around. While this makes sense in a cBPF verifier (where ALU operations are not symmetric), it does not make sense for eBPF. Replace reg_set_min_max_inv() with a helper that flips the opcode around, then lets reg_set_min_max() do the complicated work. Signed-off-by:
Jann Horn <jannh@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net
-
Jann Horn authored
The BPF verifier tried to track values based on 32-bit comparisons by (ab)using the tnum state via 581738a6 ("bpf: Provide better register bounds after jmp32 instructions"). The idea is that after a check like this: if ((u32)r0 > 3) exit We can't meaningfully constrain the arithmetic-range-based tracking, but we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003). However, the implementation from 581738a6 didn't compute the tnum constraint based on the fixed operand, but instead derives it from the arithmetic-range-based tracking. This means that after the following sequence of operations: if (r0 >= 0x1'0000'0001) exit if ((u32)r0 > 7) exit The verifier assumed that the lower half of r0 is in the range (0, 0) and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was incorrect. Provide a fixed implementation. Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions") Signed-off-by:
Jann Horn <jannh@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net
-
Daniel Borkmann authored
Anatoly has been fuzzing with kBdysch harness and reported a hang in one of the outcomes: 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 12: (95) exit from 8 to 9: safe from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: safe from 8 to 9: safe verification time 589 usec stack depth 0 processed 17 insns (limit 1000000) [...] The underlying program was xlated as follows: # bpftool p d x i 9 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 9: (76) if w0 s>= 0x303030 goto pc+2 10: (05) goto pc-1 11: (05) goto pc-1 12: (95) exit The verifier rewrote original instructions it recognized as dead code with 'goto pc-1', but reality differs from verifier simulation in that we're actually able to trigger a hang due to hitting the 'goto pc-1' instructions. Taking different examples to make the issue more obvious: in this example we're probing bounds on a completely unknown scalar variable in r1: [...] 5: R0_w=inv1 R1_w=inv(id=0) R10=fp0 5: (18) r2 = 0x4000000000 7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0 7: (18) r3 = 0x2000000000 9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0 9: (18) r4 = 0x400 11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0 11: (18) r5 = 0x200 13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 13: (2d) if r1 > r2 goto pc+4 R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: (ad) if r1 < r3 goto pc+3 R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 15: (2e) if w1 > w4 goto pc+2 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: (ae) if w1 < w5 goto pc+1 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 [...] We're first probing lower/upper bounds via jmp64, later we do a similar check via jmp32 and examine the resulting var_off there. After fall-through in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked bits in the variable section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds in jmp32. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111100000000000000000000000000000000 / 0x7f00000000 min: 0b010000000000000000000000000000000000000 / 0x2000000000 The lower/upper bounds haven't changed since they have high bits set in u64 space and the jmp32 tests can only refine bounds in the low bits. However, for the var part the expectation would have been 0x7f000007ff or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000 is not correct since it would contradict the earlier probed bounds where we know that the result should have been in [0x200,0x400] in u32 space. Therefore, tests with such info will lead to wrong verifier assumptions later on like falsely predicting conditional jumps to be always taken, etc. The issue here is that __reg_bound_offset32()'s implementation from commit 581738a6 ("bpf: Provide better register bounds after jmp32 instructions") makes an incorrect range assumption: static void __reg_bound_offset32(struct bpf_reg_state *reg) { u64 mask = 0xffffFFFF; struct tnum range = tnum_range(reg->umin_value & mask, reg->umax_value & mask); struct tnum lo32 = tnum_cast(reg->var_off, 4); struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the above walk-through example, __reg_bound_offset32() as-is chose a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000 and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as well. However, in the umin:0x2000000000 and umax:0x4000000000 range above we'd end up with an actual possible interval of [0x0,0xffffffff] for u32 space instead. In case of the original reproducer, the situation looked as follows at insn 5 for r0: [...] 5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f [...] After the fall-through, we similarly forced the var_off result into the wrong range [0x30303030,0x3030302f] suggesting later on that fixed bits must only be of 0x30303020 with 0x10000001f unknowns whereas such assumption can only be made when both bounds in hi32 range match. Originally, I was thinking to fix this by moving reg into a temp reg and use proper coerce_reg_to_size() helper on the temp reg where we can then based on that define the range tnum for later intersection: static void __reg_bound_offset32(struct bpf_reg_state *reg) { struct bpf_reg_state tmp = *reg; struct tnum lo32, hi32, range; coerce_reg_to_size(&tmp, 4); range = tnum_range(tmp.umin_value, tmp.umax_value); lo32 = tnum_cast(reg->var_off, 4); hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the case of the concrete example, this gives us a more conservative unknown section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 However, above new __reg_bound_offset32() has no effect on refining the knowledge of the register contents. Meaning, if the bounds in hi32 range mismatch we'll get the identity function given the range reg spans [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary or it again with the hi32. Likewise, if the bounds in hi32 range match, then we mask both bounds with 0xffffffff, use the resulting umin/umax for the range to later intersect the lo32 with it. However, _prior_ called __reg_bound_offset() did already such intersection on the full reg and we therefore would only repeat the same operation on the lo32 part twice. Given this has no effect and the original commit had false assumptions, this patch reverts the code entirely which is also more straight forward for stable trees: apparently 581738a6 got auto-selected by Sasha's ML system and misclassified as a fix, so it got sucked into v5.4 where it should never have landed. A revert is low-risk also from a user PoV since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU to generate jmp32 instructions. A proper bounds refinement would need a significantly more complex approach which is currently being worked, but no stable material [0]. Hence revert is best option for stable. After the revert, the original reported program gets rejected as follows: 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0 10: (30) r0 = *(u8 *)skb[808464432] BPF_LD_[ABS|IND] uses reserved fields processed 11 insns (limit 1000000) [...] [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/ Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions") Reported-by:
Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net
-
Sven Schnelle authored
Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit userland (both s390 and x86) doesn't work because there's no compat_ioctl handler defined. Add the handler. Signed-off-by:
Sven Schnelle <svens@linux.ibm.com> Fixes: 6a21cc50 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com Signed-off-by:
Kees Cook <keescook@chromium.org>
-
- Mar 29, 2020
-
-
KP Singh authored
JITed BPF programs are dynamically attached to the LSM hooks using BPF trampolines. The trampoline prologue generates code to handle conversion of the signature of the hook to the appropriate BPF context. The allocated trampoline programs are attached to the nop functions initialized as LSM hooks. BPF_PROG_TYPE_LSM programs must have a GPL compatible license and and need CAP_SYS_ADMIN (required for loading eBPF programs). Upon attachment: * A BPF fexit trampoline is used for LSM hooks with a void return type. * A BPF fmod_ret trampoline is used for LSM hooks which return an int. The attached programs can override the return value of the bpf LSM hook to indicate a MAC Policy decision. Signed-off-by:
KP Singh <kpsingh@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Brendan Jackman <jackmanb@google.com> Reviewed-by:
Florent Revest <revest@google.com> Acked-by:
Andrii Nakryiko <andriin@fb.com> Acked-by:
James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org
-
KP Singh authored
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are generated for each LSM hook. These functions are initialized as LSM hooks in a subsequent patch. Signed-off-by:
KP Singh <kpsingh@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Brendan Jackman <jackmanb@google.com> Reviewed-by:
Florent Revest <revest@google.com> Reviewed-by:
Kees Cook <keescook@chromium.org> Acked-by:
Yonghong Song <yhs@fb.com> Acked-by:
James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org
-
KP Singh authored
Introduce types and configs for bpf programs that can be attached to LSM hooks. The programs can be enabled by the config option CONFIG_BPF_LSM. Signed-off-by:
KP Singh <kpsingh@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Brendan Jackman <jackmanb@google.com> Reviewed-by:
Florent Revest <revest@google.com> Reviewed-by:
Thomas Garnier <thgarnie@google.com> Acked-by:
Yonghong Song <yhs@fb.com> Acked-by:
Andrii Nakryiko <andriin@fb.com> Acked-by:
James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
-
Roman Gushchin authored
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by:
Roman Gushchin <guro@fb.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Shakeel Butt <shakeelb@google.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 28, 2020
-
-
Thomas Gleixner authored
A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by:
Boqun Feng <boqun.feng@gmail.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: Pavankumar Kondeti <pkondeti@codeaurora.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de
-