Skip to content
Snippets Groups Projects
  1. Apr 07, 2020
  2. Apr 03, 2020
  3. Apr 02, 2020
    • Sebastian Andrzej Siewior's avatar
      mm/compaction: Disable compact_unevictable_allowed on RT · 6923aa0d
      Sebastian Andrzej Siewior authored
      Since commit 5bbe3547 ("mm: allow compaction of unevictable pages")
      it is allowed to examine mlocked pages and compact them by default.  On
      -RT even minor pagefaults are problematic because it may take a few 100us
      to resolve them and until then the task is blocked.
      
      Make compact_unevictable_allowed = 0 default and issue a warning on RT if
      it is changed.
      
      [bigeasy@linutronix.de: v5]
        Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
        Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
      
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
      Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6923aa0d
    • Sebastian Andrzej Siewior's avatar
      mm/compaction: really limit compact_unevictable_allowed to 0 and 1 · 964b692d
      Sebastian Andrzej Siewior authored
      
      The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
      `extra*' attribues have been set properly but without
      proc_dointvec_minmax() as the `proc_handler' the limit will not be
      enforced.
      
      Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
      specified range.
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      964b692d
    • Johannes Weiner's avatar
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner authored
      
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
    • Roman Gushchin's avatar
      mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() · f4b00eab
      Roman Gushchin authored
      
      Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
      to better reflect what they are actually doing:
      
      1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
         the current memcg
      
      2) set or clear the PageKmemcg flag
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b00eab
    • Chen Yu's avatar
      PM: sleep: Add pm_debug_messages kernel command line option · db96a759
      Chen Yu authored
      
      Debug messages from the system suspend/hibernation infrastructure
      are disabled by default, and can only be enabled after the system
      has boot up via /sys/power/pm_debug_messages.
      
      This makes the hibernation resume hard to track as it involves system
      boot up across hibernation.  There's no chance for software_resume()
      to track the resume process, for example.
      
      Add a kernel command line option to set pm_debug_messages during
      boot up.
      
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      [ rjw: Subject & changelog ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      db96a759
  4. Apr 01, 2020
  5. Mar 31, 2020
    • Andrii Nakryiko's avatar
      bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc
      Andrii Nakryiko authored
      
      Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
      under given bpf_link. Currently this is only supported for bpf_cgroup_link,
      but will be extended to other kinds of bpf_links in follow-up patches.
      
      For bpf_cgroup_link, implemented functionality matches existing semantics for
      direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
      unconditionally set new bpf_prog regardless of which bpf_prog is currently
      active under given bpf_link, or, optionally, can specify expected active
      bpf_prog. If active bpf_prog doesn't match expected one, no changes are
      performed, old bpf_link stays intact and attached, operation returns
      a failure.
      
      cgroup_bpf_replace() operation is resolving race between auto-detachment and
      bpf_prog update in the same fashion as it's done for bpf_link detachment,
      except in this case update has no way of succeeding because of target cgroup
      marked as dying. So in this case error is returned.
      
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
      0c991ebc
    • Andrii Nakryiko's avatar
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko authored
      
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
  6. Mar 30, 2020
    • John Fastabend's avatar
      bpf: Verifier, refine 32bit bound in do_refine_retval_range · fa123ac0
      John Fastabend authored
      
      Further refine return values range in do_refine_retval_range by noting
      these are int return types (We will assume here that int is a 32-bit type).
      
      Two reasons to pull this out of original patch. First it makes the original
      fix impossible to backport. And second I've not seen this as being problematic
      in practice unlike the other case.
      
      Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
      fa123ac0
    • John Fastabend's avatar
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend authored
      
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
    • John Fastabend's avatar
      bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly · 10060503
      John Fastabend authored
      
      do_refine_retval_range() is called to refine return values from specified
      helpers, probe_read_str and get_stack at the moment, the reasoning is
      because both have a max value as part of their input arguments and
      because the helper ensure the return value will not be larger than this
      we can set smax values of the return register, r0.
      
      However, the return value is a signed integer so setting umax is incorrect
      It leads to further confusion when the do_refine_retval_range() then calls,
      __reg_deduce_bounds() which will see a umax value as meaning the value is
      unsigned and then assuming it is unsigned set the smin = umin which in this
      case results in 'smin = 0' and an 'smax = X' where X is the input argument
      from the helper call.
      
      Here are the comments from _reg_deduce_bounds() on why this would be safe
      to do.
      
       /* Learn sign from unsigned bounds.  Signed bounds cross the sign
        * boundary, so we must be careful.
        */
       if ((s64)reg->umax_value >= 0) {
      	/* Positive.  We can't learn anything from the smin, but smax
      	 * is positive, hence safe.
      	 */
      	reg->smin_value = reg->umin_value;
      	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
      						  reg->umax_value);
      
      But now we incorrectly have a return value with type int with the
      signed bounds (0,X). Suppose the return value is negative, which is
      possible the we have the verifier and reality out of sync. Among other
      things this may result in any error handling code being falsely detected
      as dead-code and removed. For instance the example below shows using
      bpf_probe_read_str() causes the error path to be identified as dead
      code and removed.
      
      >From the 'llvm-object -S' dump,
      
       r2 = 100
       call 45
       if r0 s< 0 goto +4
       r4 = *(u32 *)(r7 + 0)
      
      But from dump xlate
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto
      
      Due to verifier state after call being
      
       R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))
      
      To fix omit setting the umax value because its not safe. The only
      actual bounds we know is the smax. This results in the correct bounds
      (SMIN, X) where X is the max length from the helper. After this the
      new verifier state looks like the following after call 45.
      
      R0=inv(id=0,smax_value=100)
      
      Then xlated version no longer removed dead code giving the expected
      result,
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (c5) if r0 s< 0x0 goto pc+4
        (61) r4 = *(u32 *)(r7 +0)
      
      Note, bpf_probe_read_* calls are root only so we wont hit this case
      with non-root bpf users.
      
      v3: comment had some documentation about meta set to null case which
      is not relevant here and confusing to include in the comment.
      
      v2 note: In original version we set msize_smax_value from check_func_arg()
      and propagated this into smax of retval. The logic was smax is the bound
      on the retval we set and because the type in the helper is ARG_CONST_SIZE
      we know that the reg is a positive tnum_const() so umax=smax. Alexei
      pointed out though this is a bit odd to read because the register in
      check_func_arg() has a C type of u32 and the umax bound would be the
      normally relavent bound here. Pulling in extra knowledge about future
      checks makes reading the code a bit tricky. Further having a signed
      meta data that can only ever be positive is also a bit odd. So dropped
      the msize_smax_value metadata and made it a u64 msize_max_value to
      indicate its unsigned. And additionally save bound from umax value in
      check_arg_funcs which is the same as smax due to as noted above tnumx_cont
      and negative check but reads better. By my analysis nothing functionally
      changes in v2 but it does get easier to read so that is win.
      
      Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
      10060503
    • KP Singh's avatar
      bpf: btf: Fix arg verification in btf_ctx_access() · f50b49a0
      KP Singh authored
      
      The bounds checking for the arguments accessed in the BPF program breaks
      when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or
      BPF_MODIFY_RETURN resulting in no check being done for the default case
      (the programs which do not receive the return value of the attached
      function in its arguments) when the index of the argument being accessed
      is equal to the number of arguments (nr_args).
      
      This was a result of a misplaced "else if" block  introduced by the
      Commit 6ba43b76 ("bpf: Attachment verification for
      BPF_MODIFY_RETURN")
      
      Fixes: 6ba43b76 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org
      f50b49a0
    • Jann Horn's avatar
      bpf: Simplify reg_set_min_max_inv handling · 0fc31b10
      Jann Horn authored
      
      reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(),
      just flipped around. While this makes sense in a cBPF verifier (where ALU
      operations are not symmetric), it does not make sense for eBPF.
      
      Replace reg_set_min_max_inv() with a helper that flips the opcode around,
      then lets reg_set_min_max() do the complicated work.
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net
      0fc31b10
    • Jann Horn's avatar
      bpf: Fix tnum constraints for 32-bit comparisons · 604dca5e
      Jann Horn authored
      
      The BPF verifier tried to track values based on 32-bit comparisons by
      (ab)using the tnum state via 581738a6 ("bpf: Provide better register
      bounds after jmp32 instructions"). The idea is that after a check like
      this:
      
          if ((u32)r0 > 3)
            exit
      
      We can't meaningfully constrain the arithmetic-range-based tracking, but
      we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003).
      However, the implementation from 581738a6 didn't compute the tnum
      constraint based on the fixed operand, but instead derives it from the
      arithmetic-range-based tracking. This means that after the following
      sequence of operations:
      
          if (r0 >= 0x1'0000'0001)
            exit
          if ((u32)r0 > 7)
            exit
      
      The verifier assumed that the lower half of r0 is in the range (0, 0)
      and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus
      causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was
      incorrect. Provide a fixed implementation.
      
      Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net
      604dca5e
    • Daniel Borkmann's avatar
      bpf: Undo incorrect __reg_bound_offset32 handling · f2d67fec
      Daniel Borkmann authored
      Anatoly has been fuzzing with kBdysch harness and reported a hang in
      one of the outcomes:
      
        0: (b7) r0 = 808464432
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
        12: (95) exit
      
        from 8 to 9: safe
      
        from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: safe
      
        from 8 to 9: safe
        verification time 589 usec
        stack depth 0
        processed 17 insns (limit 1000000) [...]
      
      The underlying program was xlated as follows:
      
        # bpftool p d x i 9
         0: (b7) r0 = 808464432
         1: (7f) r0 >>= r0
         2: (14) w0 -= 808464432
         3: (07) r0 += 808464432
         4: (b7) r1 = 808464432
         5: (de) if w1 s<= w0 goto pc+0
         6: (07) r0 += -2144337872
         7: (14) w0 -= -1607454672
         8: (25) if r0 > 0x30303030 goto pc+0
         9: (76) if w0 s>= 0x303030 goto pc+2
        10: (05) goto pc-1
        11: (05) goto pc-1
        12: (95) exit
      
      The verifier rewrote original instructions it recognized as dead code with
      'goto pc-1', but reality differs from verifier simulation in that we're
      actually able to trigger a hang due to hitting the 'goto pc-1' instructions.
      
      Taking different examples to make the issue more obvious: in this example
      we're probing bounds on a completely unknown scalar variable in r1:
      
        [...]
        5: R0_w=inv1 R1_w=inv(id=0) R10=fp0
        5: (18) r2 = 0x4000000000
        7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0
        7: (18) r3 = 0x2000000000
        9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0
        9: (18) r4 = 0x400
        11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0
        11: (18) r5 = 0x200
        13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        13: (2d) if r1 > r2 goto pc+4
         R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: (ad) if r1 < r3 goto pc+3
         R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        15: (2e) if w1 > w4 goto pc+2
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: (ae) if w1 < w5 goto pc+1
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        [...]
      
      We're first probing lower/upper bounds via jmp64, later we do a similar
      check via jmp32 and examine the resulting var_off there. After fall-through
      in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked
      bits in the variable section.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds
      in jmp32.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                          w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111100000000000000000000000000000000 / 0x7f00000000
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      The lower/upper bounds haven't changed since they have high bits set in
      u64 space and the jmp32 tests can only refine bounds in the low bits.
      
      However, for the var part the expectation would have been 0x7f000007ff
      or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000
      is not correct since it would contradict the earlier probed bounds
      where we know that the result should have been in [0x200,0x400] in u32
      space. Therefore, tests with such info will lead to wrong verifier
      assumptions later on like falsely predicting conditional jumps to be
      always taken, etc.
      
      The issue here is that __reg_bound_offset32()'s implementation from
      commit 581738a6 ("bpf: Provide better register bounds after jmp32
      instructions") makes an incorrect range assumption:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              u64 mask = 0xffffFFFF;
              struct tnum range = tnum_range(reg->umin_value & mask,
                                             reg->umax_value & mask);
              struct tnum lo32 = tnum_cast(reg->var_off, 4);
              struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
      
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the above walk-through example, __reg_bound_offset32() as-is chose
      a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000
      and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as
      well. However, in the umin:0x2000000000 and umax:0x4000000000 range above
      we'd end up with an actual possible interval of [0x0,0xffffffff] for u32
      space instead.
      
      In case of the original reproducer, the situation looked as follows at
      insn 5 for r0:
      
        [...]
        5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
                                     0x30303030           0x13030302f
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0
                                   0x30303030           0x13030302f
        [...]
      
      After the fall-through, we similarly forced the var_off result into
      the wrong range [0x30303030,0x3030302f] suggesting later on that fixed
      bits must only be of 0x30303020 with 0x10000001f unknowns whereas such
      assumption can only be made when both bounds in hi32 range match.
      
      Originally, I was thinking to fix this by moving reg into a temp reg and
      use proper coerce_reg_to_size() helper on the temp reg where we can then
      based on that define the range tnum for later intersection:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              struct bpf_reg_state tmp = *reg;
              struct tnum lo32, hi32, range;
      
              coerce_reg_to_size(&tmp, 4);
              range = tnum_range(tmp.umin_value, tmp.umax_value);
              lo32 = tnum_cast(reg->var_off, 4);
              hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the case of the concrete example, this gives us a more conservative unknown
      section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                                   w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      However, above new __reg_bound_offset32() has no effect on refining the
      knowledge of the register contents. Meaning, if the bounds in hi32 range
      mismatch we'll get the identity function given the range reg spans
      [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary
      or it again with the hi32.
      
      Likewise, if the bounds in hi32 range match, then we mask both bounds
      with 0xffffffff, use the resulting umin/umax for the range to later
      intersect the lo32 with it. However, _prior_ called __reg_bound_offset()
      did already such intersection on the full reg and we therefore would only
      repeat the same operation on the lo32 part twice.
      
      Given this has no effect and the original commit had false assumptions,
      this patch reverts the code entirely which is also more straight forward
      for stable trees: apparently 581738a6 got auto-selected by Sasha's
      ML system and misclassified as a fix, so it got sucked into v5.4 where
      it should never have landed. A revert is low-risk also from a user PoV
      since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU
      to generate jmp32 instructions. A proper bounds refinement would need a
      significantly more complex approach which is currently being worked, but
      no stable material [0]. Hence revert is best option for stable. After the
      revert, the original reported program gets rejected as follows:
      
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
         R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0
        10: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 11 insns (limit 1000000) [...]
      
        [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/
      
      
      
      Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions")
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net
      f2d67fec
    • Sven Schnelle's avatar
      seccomp: Add missing compat_ioctl for notify · 3db81afd
      Sven Schnelle authored
      
      Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit
      userland (both s390 and x86) doesn't work because there's no compat_ioctl
      handler defined. Add the handler.
      
      Signed-off-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Fixes: 6a21cc50 ("seccomp: add a return code to trap to userspace")
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      3db81afd
  7. Mar 29, 2020
  8. Mar 28, 2020
Loading