Skip to content
Snippets Groups Projects
  1. Mar 03, 2025
    • Cong Wang's avatar
      netem: Update sch->q.qlen before qdisc_tree_reduce_backlog() · b174bb76
      Cong Wang authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 638ba5089324796c2ee49af10427459c2de35f71 ]
      
      qdisc_tree_reduce_backlog() notifies parent qdisc only if child
      qdisc becomes empty, therefore we need to reduce the backlog of the
      child qdisc before calling it. Otherwise it would miss the opportunity
      to call cops->qlen_notify(), in the case of DRR, it resulted in UAF
      since DRR uses ->qlen_notify() to maintain its active list.
      
      Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc")
      Cc: Martin Ottens <martin.ottens@fau.de>
      Reported-by: default avatarMingi Cho <mincho@theori.io>
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Link: https://patch.msgid.link/20250204005841.223511-4-xiyou.wangcong@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b174bb76
    • Jamal Hadi Salim's avatar
      net: sched: Disallow replacing of child qdisc from one parent to another · 0a456cb9
      Jamal Hadi Salim authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit bc50835e83f60f56e9bec2b392fb5544f250fb6f ]
      
      Lion Ackermann was able to create a UAF which can be abused for privilege
      escalation with the following script
      
      Step 1. create root qdisc
      tc qdisc add dev lo root handle 1:0 drr
      
      step2. a class for packet aggregation do demonstrate uaf
      tc class add dev lo classid 1:1 drr
      
      step3. a class for nesting
      tc class add dev lo classid 1:2 drr
      
      step4. a class to graft qdisc to
      tc class add dev lo classid 1:3 drr
      
      step5.
      tc qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024
      
      step6.
      tc qdisc add dev lo parent 1:2 handle 3:0 drr
      
      step7.
      tc class add dev lo classid 3:1 drr
      
      step 8.
      tc qdisc add dev lo parent 3:1 handle 4:0 pfifo
      
      step 9. Display the class/qdisc layout
      
      tc class ls dev lo
       class drr 1:1 root leaf 2: quantum 64Kb
       class drr 1:2 root leaf 3: quantum 64Kb
       class drr 3:1 root leaf 4: quantum 64Kb
      
      tc qdisc ls
       qdisc drr 1: dev lo root refcnt 2
       qdisc plug 2: dev lo parent 1:1
       qdisc pfifo 4: dev lo parent 3:1 limit 1000p
       qdisc drr 3: dev lo parent 1:2
      
      step10. trigger the bug <=== prevented by this patch
      tc qdisc replace dev lo parent 1:3 handle 4:0
      
      step 11. Redisplay again the qdiscs/classes
      
      tc class ls dev lo
       class drr 1:1 root leaf 2: quantum 64Kb
       class drr 1:2 root leaf 3: quantum 64Kb
       class drr 1:3 root leaf 4: quantum 64Kb
       class drr 3:1 root leaf 4: quantum 64Kb
      
      tc qdisc ls
       qdisc drr 1: dev lo root refcnt 2
       qdisc plug 2: dev lo parent 1:1
       qdisc pfifo 4: dev lo parent 3:1 refcnt 2 limit 1000p
       qdisc drr 3: dev lo parent 1:2
      
      Observe that a) parent for 4:0 does not change despite the replace request.
      There can only be one parent.  b) refcount has gone up by two for 4:0 and
      c) both class 1:3 and 3:1 are pointing to it.
      
      Step 12.  send one packet to plug
      echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001))
      step13.  send one packet to the grafted fifo
      echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10003))
      
      step14. lets trigger the uaf
      tc class delete dev lo classid 1:3
      tc class delete dev lo classid 1:1
      
      The semantics of "replace" is for a del/add _on the same node_ and not
      a delete from one node(3:1) and add to another node (1:3) as in step10.
      While we could "fix" with a more complex approach there could be
      consequences to expectations so the patch takes the preventive approach of
      "disallow such config".
      
      Joint work with Lion Ackermann <nnamrec@gmail.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20250116013713.900000-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0a456cb9
    • Octavian Purdila's avatar
      net_sched: sch_sfq: don't allow 1 packet limit · a7dafaf0
      Octavian Purdila authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 10685681bafce6febb39770f3387621bf5d67d0b ]
      
      The current implementation does not work correctly with a limit of
      1. iproute2 actually checks for this and this patch adds the check in
      kernel as well.
      
      This fixes the following syzkaller reported crash:
      
      UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:210:6
      index 65535 is out of range for type 'struct sfq_head[128]'
      CPU: 0 PID: 2569 Comm: syz-executor101 Not tainted 5.10.0-smp-DEV #1
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
      Call Trace:
        __dump_stack lib/dump_stack.c:79 [inline]
        dump_stack+0x125/0x19f lib/dump_stack.c:120
        ubsan_epilogue lib/ubsan.c:148 [inline]
        __ubsan_handle_out_of_bounds+0xed/0x120 lib/ubsan.c:347
        sfq_link net/sched/sch_sfq.c:210 [inline]
        sfq_dec+0x528/0x600 net/sched/sch_sfq.c:238
        sfq_dequeue+0x39b/0x9d0 net/sched/sch_sfq.c:500
        sfq_reset+0x13/0x50 net/sched/sch_sfq.c:525
        qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
        tbf_reset+0x3d/0x100 net/sched/sch_tbf.c:319
        qdisc_reset+0xfe/0x510 net/sched/sch_generic.c:1026
        dev_reset_queue+0x8c/0x140 net/sched/sch_generic.c:1296
        netdev_for_each_tx_queue include/linux/netdevice.h:2350 [inline]
        dev_deactivate_many+0x6dc/0xc20 net/sched/sch_generic.c:1362
        __dev_close_many+0x214/0x350 net/core/dev.c:1468
        dev_close_many+0x207/0x510 net/core/dev.c:1506
        unregister_netdevice_many+0x40f/0x16b0 net/core/dev.c:10738
        unregister_netdevice_queue+0x2be/0x310 net/core/dev.c:10695
        unregister_netdevice include/linux/netdevice.h:2893 [inline]
        __tun_detach+0x6b6/0x1600 drivers/net/tun.c:689
        tun_detach drivers/net/tun.c:705 [inline]
        tun_chr_close+0x104/0x1b0 drivers/net/tun.c:3640
        __fput+0x203/0x840 fs/file_table.c:280
        task_work_run+0x129/0x1b0 kernel/task_work.c:185
        exit_task_work include/linux/task_work.h:33 [inline]
        do_exit+0x5ce/0x2200 kernel/exit.c:931
        do_group_exit+0x144/0x310 kernel/exit.c:1046
        __do_sys_exit_group kernel/exit.c:1057 [inline]
        __se_sys_exit_group kernel/exit.c:1055 [inline]
        __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:1055
       do_syscall_64+0x6c/0xd0
       entry_SYSCALL_64_after_hwframe+0x61/0xcb
      RIP: 0033:0x7fe5e7b52479
      Code: Unable to access opcode bytes at RIP 0x7fe5e7b5244f.
      RSP: 002b:00007ffd3c800398 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5e7b52479
      RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
      RBP: 00007fe5e7bcd2d0 R08: ffffffffffffffb8 R09: 0000000000000014
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5e7bcd2d0
      R13: 0000000000000000 R14: 00007fe5e7bcdd20 R15: 00007fe5e7b24270
      
      The crash can be also be reproduced with the following (with a tc
      recompiled to allow for sfq limits of 1):
      
      tc qdisc add dev dummy0 handle 1: root tbf rate 1Kbit burst 100b lat 1s
      ../iproute2-6.9.0/tc/tc qdisc add dev dummy0 handle 2: parent 1:10 sfq limit 1
      ifconfig dummy0 up
      ping -I dummy0 -f -c2 -W0.1 8.8.8.8
      sleep 1
      
      Scenario that triggers the crash:
      
      * the first packet is sent and queued in TBF and SFQ; qdisc qlen is 1
      
      * TBF dequeues: it peeks from SFQ which moves the packet to the
        gso_skb list and keeps qdisc qlen set to 1. TBF is out of tokens so
        it schedules itself for later.
      
      * the second packet is sent and TBF tries to queues it to SFQ. qdisc
        qlen is now 2 and because the SFQ limit is 1 the packet is dropped
        by SFQ. At this point qlen is 1, and all of the SFQ slots are empty,
        however q->tail is not NULL.
      
      At this point, assuming no more packets are queued, when sch_dequeue
      runs again it will decrement the qlen for the current empty slot
      causing an underflow and the subsequent out of bounds access.
      
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarOctavian Purdila <tavip@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20241204030520.2084663-2-tavip@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a7dafaf0
    • Eric Dumazet's avatar
      net_sched: sch_sfq: handle bigger packets · 5ccb8efc
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit e4650d7ae4252f67e997a632adfae0dd74d3a99a ]
      
      SFQ has an assumption on dealing with packets smaller than 64KB.
      
      Even before BIG TCP, TCA_STAB can provide arbitrary big values
      in qdisc_pkt_len(skb)
      
      It is time to switch (struct sfq_slot)->allot to a 32bit field.
      
      sizeof(struct sfq_slot) is now 64 bytes, giving better cache locality.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://patch.msgid.link/20241008111603.653140-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: 10685681bafc ("net_sched: sch_sfq: don't allow 1 packet limit")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5ccb8efc
    • Eric Dumazet's avatar
      net_sched: sch_sfq: annotate data-races around q->perturb_period · 404a198a
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit a17ef9e6 ]
      
      sfq_perturbation() reads q->perturb_period locklessly.
      Add annotations to fix potential issues.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240430180015.3111398-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: 10685681bafc ("net_sched: sch_sfq: don't allow 1 packet limit")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      404a198a
  2. Feb 03, 2025
    • Jamal Hadi Salim's avatar
      net: sched: fix ets qdisc OOB Indexing · 1060f352
      Jamal Hadi Salim authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit d62b04fca4340a0d468d7853bd66e511935a18cb upstream.
      
      Haowei Yan <g1042620637@gmail.com> found that ets_class_from_arg() can
      index an Out-Of-Bound class in ets_class_from_arg() when passed clid of
      0. The overflow may cause local privilege escalation.
      
       [   18.852298] ------------[ cut here ]------------
       [   18.853271] UBSAN: array-index-out-of-bounds in net/sched/sch_ets.c:93:20
       [   18.853743] index 18446744073709551615 is out of range for type 'ets_class [16]'
       [   18.854254] CPU: 0 UID: 0 PID: 1275 Comm: poc Not tainted 6.12.6-dirty #17
       [   18.854821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
       [   18.856532] Call Trace:
       [   18.857441]  <TASK>
       [   18.858227]  dump_stack_lvl+0xc2/0xf0
       [   18.859607]  dump_stack+0x10/0x20
       [   18.860908]  __ubsan_handle_out_of_bounds+0xa7/0xf0
       [   18.864022]  ets_class_change+0x3d6/0x3f0
       [   18.864322]  tc_ctl_tclass+0x251/0x910
       [   18.864587]  ? lock_acquire+0x5e/0x140
       [   18.865113]  ? __mutex_lock+0x9c/0xe70
       [   18.866009]  ? __mutex_lock+0xa34/0xe70
       [   18.866401]  rtnetlink_rcv_msg+0x170/0x6f0
       [   18.866806]  ? __lock_acquire+0x578/0xc10
       [   18.867184]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
       [   18.867503]  netlink_rcv_skb+0x59/0x110
       [   18.867776]  rtnetlink_rcv+0x15/0x30
       [   18.868159]  netlink_unicast+0x1c3/0x2b0
       [   18.868440]  netlink_sendmsg+0x239/0x4b0
       [   18.868721]  ____sys_sendmsg+0x3e2/0x410
       [   18.869012]  ___sys_sendmsg+0x88/0xe0
       [   18.869276]  ? rseq_ip_fixup+0x198/0x260
       [   18.869563]  ? rseq_update_cpu_node_id+0x10a/0x190
       [   18.869900]  ? trace_hardirqs_off+0x5a/0xd0
       [   18.870196]  ? syscall_exit_to_user_mode+0xcc/0x220
       [   18.870547]  ? do_syscall_64+0x93/0x150
       [   18.870821]  ? __memcg_slab_free_hook+0x69/0x290
       [   18.871157]  __sys_sendmsg+0x69/0xd0
       [   18.871416]  __x64_sys_sendmsg+0x1d/0x30
       [   18.871699]  x64_sys_call+0x9e2/0x2670
       [   18.871979]  do_syscall_64+0x87/0x150
       [   18.873280]  ? do_syscall_64+0x93/0x150
       [   18.874742]  ? lock_release+0x7b/0x160
       [   18.876157]  ? do_user_addr_fault+0x5ce/0x8f0
       [   18.877833]  ? irqentry_exit_to_user_mode+0xc2/0x210
       [   18.879608]  ? irqentry_exit+0x77/0xb0
       [   18.879808]  ? clear_bhb_loop+0x15/0x70
       [   18.880023]  ? clear_bhb_loop+0x15/0x70
       [   18.880223]  ? clear_bhb_loop+0x15/0x70
       [   18.880426]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
       [   18.880683] RIP: 0033:0x44a957
       [   18.880851] Code: ff ff e8 fc 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 8974 24 10
       [   18.881766] RSP: 002b:00007ffcdd00fad8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       [   18.882149] RAX: ffffffffffffffda RBX: 00007ffcdd010db8 RCX: 000000000044a957
       [   18.882507] RDX: 0000000000000000 RSI: 00007ffcdd00fb70 RDI: 0000000000000003
       [   18.885037] RBP: 00007ffcdd010bc0 R08: 000000000703c770 R09: 000000000703c7c0
       [   18.887203] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000001
       [   18.888026] R13: 00007ffcdd010da8 R14: 00000000004ca7d0 R15: 0000000000000001
       [   18.888395]  </TASK>
       [   18.888610] ---[ end trace ]---
      
      Fixes: dcc68b4d ("net: sch_ets: Add a new Qdisc")
      Reported-by: default avatarHaowei Yan <g1042620637@gmail.com>
      Suggested-by: default avatarHaowei Yan <g1042620637@gmail.com>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Link: https://patch.msgid.link/20250111145740.74755-1-jhs@mojatatu.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1060f352
    • Toke Høiland-Jørgensen's avatar
      sched: sch_cake: add bounds checks to host bulk flow fairness counts · 678b1769
      Toke Høiland-Jørgensen authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 737d4d91d35b5f7fa5bb442651472277318b0bfd ]
      
      Even though we fixed a logic error in the commit cited below, syzbot
      still managed to trigger an underflow of the per-host bulk flow
      counters, leading to an out of bounds memory access.
      
      To avoid any such logic errors causing out of bounds memory accesses,
      this commit factors out all accesses to the per-host bulk flow counters
      to a series of helpers that perform bounds-checking before any
      increments and decrements. This also has the benefit of improving
      readability by moving the conditional checks for the flow mode into
      these helpers, instead of having them spread out throughout the
      code (which was the cause of the original logic error).
      
      As part of this change, the flow quantum calculation is consolidated
      into a helper function, which means that the dithering applied to the
      ost load scaling is now applied both in the DRR rotation and when a
      sparse flow's quantum is first initiated. The only user-visible effect
      of this is that the maximum packet size that can be sent while a flow
      stays sparse will now vary with +/- one byte in some cases. This should
      not make a noticeable difference in practice, and thus it's not worth
      complicating the code to preserve the old behaviour.
      
      Fixes: 546ea84d ("sched: sch_cake: fix bulk flow accounting logic for host fairness")
      Reported-by: default avatar <syzbot+f63600d288bfb7057424@syzkaller.appspotmail.com>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarDave Taht <dave.taht@gmail.com>
      Link: https://patch.msgid.link/20250107120105.70685-1-toke@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      678b1769
    • Eric Dumazet's avatar
      net_sched: cls_flow: validate TCA_FLOW_RSHIFT attribute · 7aaa58bd
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit a039e54397c6a75b713b9ce7894a62e06956aa92 ]
      
      syzbot found that TCA_FLOW_RSHIFT attribute was not validated.
      Right shitfing a 32bit integer is undefined for large shift values.
      
      UBSAN: shift-out-of-bounds in net/sched/cls_flow.c:329:23
      shift exponent 9445 is too large for 32-bit type 'u32' (aka 'unsigned int')
      CPU: 1 UID: 0 PID: 54 Comm: kworker/u8:3 Not tainted 6.13.0-rc3-syzkaller-00180-g4f619d518db9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
      Workqueue: ipv6_addrconf addrconf_dad_work
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:94 [inline]
        dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
        ubsan_epilogue lib/ubsan.c:231 [inline]
        __ubsan_handle_shift_out_of_bounds+0x3c8/0x420 lib/ubsan.c:468
        flow_classify+0x24d5/0x25b0 net/sched/cls_flow.c:329
        tc_classify include/net/tc_wrapper.h:197 [inline]
        __tcf_classify net/sched/cls_api.c:1771 [inline]
        tcf_classify+0x420/0x1160 net/sched/cls_api.c:1867
        sfb_classify net/sched/sch_sfb.c:260 [inline]
        sfb_enqueue+0x3ad/0x18b0 net/sched/sch_sfb.c:318
        dev_qdisc_enqueue+0x4b/0x290 net/core/dev.c:3793
        __dev_xmit_skb net/core/dev.c:3889 [inline]
        __dev_queue_xmit+0xf0e/0x3f50 net/core/dev.c:4400
        dev_queue_xmit include/linux/netdevice.h:3168 [inline]
        neigh_hh_output include/net/neighbour.h:523 [inline]
        neigh_output include/net/neighbour.h:537 [inline]
        ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:236
        iptunnel_xmit+0x55d/0x9b0 net/ipv4/ip_tunnel_core.c:82
        udp_tunnel_xmit_skb+0x262/0x3b0 net/ipv4/udp_tunnel_core.c:173
        geneve_xmit_skb drivers/net/geneve.c:916 [inline]
        geneve_xmit+0x21dc/0x2d00 drivers/net/geneve.c:1039
        __netdev_start_xmit include/linux/netdevice.h:5002 [inline]
        netdev_start_xmit include/linux/netdevice.h:5011 [inline]
        xmit_one net/core/dev.c:3590 [inline]
        dev_hard_start_xmit+0x27a/0x7d0 net/core/dev.c:3606
        __dev_queue_xmit+0x1b73/0x3f50 net/core/dev.c:4434
      
      Fixes: e5dfb815 ("[NET_SCHED]: Add flow classifier")
      Reported-by: default avatar <syzbot+1dbb57d994e54aaa04d2@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/netdev/6777bf49.050a0220.178762.0040.GAE@google.com/T/#u
      
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20250103104546.3714168-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7aaa58bd
  3. Jan 14, 2025
    • Lion Ackermann's avatar
      net: sched: fix ordering of qlen adjustment · 1cc6a9d7
      Lion Ackermann authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 5eb7de8cd58e73851cd37ff8d0666517d9926948 upstream.
      
      Changes to sch->q.qlen around qdisc_tree_reduce_backlog() need to happen
      _before_ a call to said function because otherwise it may fail to notify
      parent qdiscs when the child is about to become empty.
      
      Signed-off-by: default avatarLion Ackermann <nnamrec@gmail.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Artem Metla <ametla@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cc6a9d7
    • Martin Ottens's avatar
      net/sched: netem: account for backlog updates from child qdisc · 8c9243af
      Martin Ottens authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit f8d4bc455047cf3903cd6f85f49978987dbb3027 ]
      
      In general, 'qlen' of any classful qdisc should keep track of the
      number of packets that the qdisc itself and all of its children holds.
      In case of netem, 'qlen' only accounts for the packets in its internal
      tfifo. When netem is used with a child qdisc, the child qdisc can use
      'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
      or dropped SKBs. This function updates 'qlen' and the backlog statistics
      of netem, but netem does not account for changes made by a child qdisc.
      'qlen' then indicates the wrong number of packets in the tfifo.
      If a child qdisc creates new SKBs during enqueue and informs its parent
      about this, netem's 'qlen' value is increased. When netem dequeues the
      newly created SKBs from the child, the 'qlen' in netem is not updated.
      If 'qlen' reaches the configured sch->limit, the enqueue function stops
      working, even though the tfifo is not full.
      
      Reproduce the bug:
      Ensure that the sender machine has GSO enabled. Configure netem as root
      qdisc and tbf as its child on the outgoing interface of the machine
      as follows:
      $ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
      $ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms
      
      Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
      client on the machine. Check the qdisc statistics:
      $ tc -s qdisc show dev <oif>
      
      Statistics after 10s of iPerf3 TCP test before the fix (note that
      netem's backlog > limit, netem stopped accepting packets):
      qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
       Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0)
       backlog 4294528236b 1155p requeues 0
      qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
       Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0)
       backlog 0b 0p requeues 0
      
      Statistics after the fix:
      qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
       Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
       Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0)
       backlog 0b 0p requeues 0
      
      tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
      The interface fully stops transferring packets and "locks". In this case,
      the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
      its limit and no more packets are accepted.
      
      This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
      only decreased when a packet is returned by its dequeue function, and not
      during enqueuing into the child qdisc. External updates to 'qlen' are thus
      accounted for and only the behavior of the backlog statistics changes. As
      in other qdiscs, 'qlen' then keeps track of  how many packets are held in
      netem and all of its children. As before, sch->limit remains as the
      maximum number of packets in the tfifo. The same applies to netem's
      backlog statistics.
      
      Fixes: 50612537 ("netem: fix classful handling")
      Signed-off-by: default avatarMartin Ottens <martin.ottens@fau.de>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://patch.msgid.link/20241210131412.1837202-1-martin.ottens@fau.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8c9243af
    • Elena Salomatkina's avatar
      net/sched: cbs: Fix integer overflow in cbs_set_port_rate() · 005cf224
      Elena Salomatkina authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 397006ba5d918f9b74e734867e8fddbc36dc2282 ]
      
      The subsequent calculation of port_rate = speed * 1000 * BYTES_PER_KBIT,
      where the BYTES_PER_KBIT is of type LL, may cause an overflow.
      At least when speed = SPEED_20000, the expression to the left of port_rate
      will be greater than INT_MAX.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Signed-off-by: default avatarElena Salomatkina <esalomatkina@ispras.ru>
      Link: https://patch.msgid.link/20241013124529.1043-1-esalomatkina@ispras.ru
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      005cf224
    • Xin Long's avatar
      net: sched: fix erspan_opt settings in cls_flower · 461b8616
      Xin Long authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 292207809486d99c78068d3f459cbbbffde88415 ]
      
      When matching erspan_opt in cls_flower, only the (version, dir, hwid)
      fields are relevant. However, in fl_set_erspan_opt() it initializes
      all bits of erspan_opt and its mask to 1. This inadvertently requires
      packets to match not only the (version, dir, hwid) fields but also the
      other fields that are unexpectedly set to 1.
      
      This patch resolves the issue by ensuring that only the (version, dir,
      hwid) fields are configured in fl_set_erspan_opt(), leaving the other
      fields to 0 in erspan_opt.
      
      Fixes: 79b1011c ("net: sched: allow flower to match erspan options")
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      461b8616
    • Martin Ottens's avatar
      net/sched: tbf: correct backlog statistic for GSO packets · cb81e6af
      Martin Ottens authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 1596a135e3180c92e42dd1fbcad321f4fb3e3b17 ]
      
      When the length of a GSO packet in the tbf qdisc is larger than the burst
      size configured the packet will be segmented by the tbf_segment function.
      Whenever this function is used to enqueue SKBs, the backlog statistic of
      the tbf is not increased correctly. This can lead to underflows of the
      'backlog' byte-statistic value when these packets are dequeued from tbf.
      
      Reproduce the bug:
      Ensure that the sender machine has GSO enabled. Configured the tbf on
      the outgoing interface of the machine as follows (burstsize = 1 MTU):
      $ tc qdisc add dev <oif> root handle 1: tbf rate 50Mbit burst 1514 latency 50ms
      
      Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
      client on this machine. Check the qdisc statistics:
      $ tc -s qdisc show dev <oif>
      
      The 'backlog' byte-statistic has incorrect values while traffic is
      transferred, e.g., high values due to u32 underflows. When the transfer
      is stopped, the value is != 0, which should never happen.
      
      This patch fixes this bug by updating the statistics correctly, even if
      single SKBs of a GSO SKB cannot be enqueued.
      
      Fixes: e43ac79a ("sch_tbf: segment too big GSO packets")
      Signed-off-by: default avatarMartin Ottens <martin.ottens@fau.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20241125174608.1484356-1-martin.ottens@fau.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cb81e6af
    • Eric Dumazet's avatar
      net: use unrcu_pointer() helper · e33bd588
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit b4cb4a13 ]
      
      Toke mentioned unrcu_pointer() existence, allowing
      to remove some of the ugly casts we have when using
      xchg() for rcu protected pointers.
      
      Also make inet_rcv_compat const.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Stable-dep-of: eb02688c5c45 ("ipv6: release nexthop on device removal")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e33bd588
    • Vladimir Oltean's avatar
      net/sched: taprio: extend minimum interval restriction to entire cycle too · 883213b9
      Vladimir Oltean authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit fb66df20 ]
      
      It is possible for syzbot to side-step the restriction imposed by the
      blamed commit in the Fixes: tag, because the taprio UAPI permits a
      cycle-time different from (and potentially shorter than) the sum of
      entry intervals.
      
      We need one more restriction, which is that the cycle time itself must
      be larger than N * ETH_ZLEN bit times, where N is the number of schedule
      entries. This restriction needs to apply regardless of whether the cycle
      time came from the user or was the implicit, auto-calculated value, so
      we move the existing "cycle == 0" check outside the "if "(!new->cycle_time)"
      branch. This way covers both conditions and scenarios.
      
      Add a selftest which illustrates the issue triggered by syzbot.
      
      Fixes: b5b73b26 ("taprio: Fix allowing too small intervals")
      Reported-by: default avatar <syzbot+a7d2b1d5d1af83035567@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/netdev/0000000000007d66bc06196e7c66@google.com/
      
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20240527153955.553333-2-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarXiangyu Chen <xiangyu.chen@windriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      883213b9
    • Alexandre Ferrieux's avatar
      net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes. · 0ada4d9e
      Alexandre Ferrieux authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 73af53d8 ]
      
      To generate hnode handles (in gen_new_htid()), u32 uses IDR and
      encodes the returned small integer into a structured 32-bit
      word. Unfortunately, at disposal time, the needed decoding
      is not done. As a result, idr_remove() fails, and the IDR
      fills up. Since its size is 2048, the following script ends up
      with "Filter already exists":
      
        tc filter add dev myve $FILTER1
        tc filter add dev myve $FILTER2
        for i in {1..2048}
        do
          echo $i
          tc filter del dev myve $FILTER2
          tc filter add dev myve $FILTER2
        done
      
      This patch adds the missing decoding logic for handles that
      deserve it.
      
      Fixes: e7614370 ("net_sched: use idr to allocate u32 filter handles")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarAlexandre Ferrieux <alexandre.ferrieux@orange.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Link: https://patch.msgid.link/20241110172836.331319-1-alexandre.ferrieux@orange.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0ada4d9e
    • Pedro Tammela's avatar
      net/sched: cls_u32: replace int refcounts with proper refcounts · c7ab7d7a
      Pedro Tammela authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 6b78debe ]
      
      Proper refcounts will always warn splat when something goes wrong,
      be it underflow, saturation or object resurrection. As these are always
      a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues.
      Another benefit is that the refcount API self documents the code, making
      clear when transitions to dead are expected.
      
      For such an update we had to make minor adaptations on u32 to fit the refcount
      API. First we set explicitly to '1' when objects are created, then the
      objects are alive until a 1 -> 0 happens, which is then released appropriately.
      
      The above made clear some redundant operations in the u32 code
      around the root_ht handling that were removed. The root_ht is created
      with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2.
      Throughout the entire code the root_ht is an exceptional case and can never be referenced,
      therefore the refcnt never incremented/decremented.
      Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto
      the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did
      a double decrement to free it, which is not a fit for the refcount API.
      
      Even though refcount_t is implemented using atomics, we should observe
      a negligible control plane impact.
      
      Signed-off-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20231114141856.974326-2-pctammela@mojatatu.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: 73af53d8 ("net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c7ab7d7a
    • Dmitry Antipov's avatar
      net: sched: use RCU read-side critical section in taprio_dump() · 44285381
      Dmitry Antipov authored and Frieder Schrempf's avatar Frieder Schrempf committed
      commit b22db8b8 upstream.
      
      Fix possible use-after-free in 'taprio_dump()' by adding RCU
      read-side critical section there. Never seen on x86 but
      found on a KASAN-enabled arm64 system when investigating
      https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa
      
      :
      
      [T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0
      [T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862
      [T15862]
      [T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty #2
      [T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024
      [T15862] Call trace:
      [T15862]  dump_backtrace+0x20c/0x220
      [T15862]  show_stack+0x2c/0x40
      [T15862]  dump_stack_lvl+0xf8/0x174
      [T15862]  print_report+0x170/0x4d8
      [T15862]  kasan_report+0xb8/0x1d4
      [T15862]  __asan_report_load4_noabort+0x20/0x2c
      [T15862]  taprio_dump+0xa0c/0xbb0
      [T15862]  tc_fill_qdisc+0x540/0x1020
      [T15862]  qdisc_notify.isra.0+0x330/0x3a0
      [T15862]  tc_modify_qdisc+0x7b8/0x1838
      [T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
      [T15862]  netlink_rcv_skb+0x1f8/0x3d4
      [T15862]  rtnetlink_rcv+0x28/0x40
      [T15862]  netlink_unicast+0x51c/0x790
      [T15862]  netlink_sendmsg+0x79c/0xc20
      [T15862]  __sock_sendmsg+0xe0/0x1a0
      [T15862]  ____sys_sendmsg+0x6c0/0x840
      [T15862]  ___sys_sendmsg+0x1ac/0x1f0
      [T15862]  __sys_sendmsg+0x110/0x1d0
      [T15862]  __arm64_sys_sendmsg+0x74/0xb0
      [T15862]  invoke_syscall+0x88/0x2e0
      [T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
      [T15862]  do_el0_svc+0x44/0x60
      [T15862]  el0_svc+0x50/0x184
      [T15862]  el0t_64_sync_handler+0x120/0x12c
      [T15862]  el0t_64_sync+0x190/0x194
      [T15862]
      [T15862] Allocated by task 15857:
      [T15862]  kasan_save_stack+0x3c/0x70
      [T15862]  kasan_save_track+0x20/0x3c
      [T15862]  kasan_save_alloc_info+0x40/0x60
      [T15862]  __kasan_kmalloc+0xd4/0xe0
      [T15862]  __kmalloc_cache_noprof+0x194/0x334
      [T15862]  taprio_change+0x45c/0x2fe0
      [T15862]  tc_modify_qdisc+0x6a8/0x1838
      [T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
      [T15862]  netlink_rcv_skb+0x1f8/0x3d4
      [T15862]  rtnetlink_rcv+0x28/0x40
      [T15862]  netlink_unicast+0x51c/0x790
      [T15862]  netlink_sendmsg+0x79c/0xc20
      [T15862]  __sock_sendmsg+0xe0/0x1a0
      [T15862]  ____sys_sendmsg+0x6c0/0x840
      [T15862]  ___sys_sendmsg+0x1ac/0x1f0
      [T15862]  __sys_sendmsg+0x110/0x1d0
      [T15862]  __arm64_sys_sendmsg+0x74/0xb0
      [T15862]  invoke_syscall+0x88/0x2e0
      [T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
      [T15862]  do_el0_svc+0x44/0x60
      [T15862]  el0_svc+0x50/0x184
      [T15862]  el0t_64_sync_handler+0x120/0x12c
      [T15862]  el0t_64_sync+0x190/0x194
      [T15862]
      [T15862] Freed by task 6192:
      [T15862]  kasan_save_stack+0x3c/0x70
      [T15862]  kasan_save_track+0x20/0x3c
      [T15862]  kasan_save_free_info+0x4c/0x80
      [T15862]  poison_slab_object+0x110/0x160
      [T15862]  __kasan_slab_free+0x3c/0x74
      [T15862]  kfree+0x134/0x3c0
      [T15862]  taprio_free_sched_cb+0x18c/0x220
      [T15862]  rcu_core+0x920/0x1b7c
      [T15862]  rcu_core_si+0x10/0x1c
      [T15862]  handle_softirqs+0x2e8/0xd64
      [T15862]  __do_softirq+0x14/0x20
      
      Fixes: 18cdd2f0 ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex")
      Acked-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Link: https://patch.msgid.link/20241018051339.418890-2-dmantipov@yandex.ru
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      [Lee: Backported from linux-6.6.y to linux-6.1.y and fixed conflicts]
      Signed-off-by: default avatarLee Jones <lee@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44285381
    • Pedro Tammela's avatar
      net/sched: stop qdisc_tree_reduce_backlog on TC_H_ROOT · bff077a6
      Pedro Tammela authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 2e95c438 ]
      
      In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed
      to be either root or ingress. This assumption is bogus since it's valid
      to create egress qdiscs with major handle ffff:
      Budimir Markovic found that for qdiscs like DRR that maintain an active
      class list, it will cause a UAF with a dangling class pointer.
      
      In 066a3b5b, the concern was to avoid iterating over the ingress
      qdisc since its parent is itself. The proper fix is to stop when parent
      TC_H_ROOT is reached because the only way to retrieve ingress is when a
      hierarchy which does not contain a ffff: major handle call into
      qdisc_lookup with TC_H_MAJ(TC_H_ROOT).
      
      In the scenario where major ffff: is an egress qdisc in any of the tree
      levels, the updates will also propagate to TC_H_ROOT, which then the
      iteration must stop.
      
      Fixes: 066a3b5b ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop")
      Reported-by: default avatarBudimir Markovic <markovicbudimir@gmail.com>
      Suggested-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      
       net/sched/sch_api.c | 2 +-
       1 file changed, 1 insertion(+), 1 deletion(-)
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      
      Link: https://patch.msgid.link/20241024165547.418570-1-jhs@mojatatu.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bff077a6
    • Dmitry Antipov's avatar
      net: sched: fix use-after-free in taprio_change() · cccc9d59
      Dmitry Antipov authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit f5044659 ]
      
      In 'taprio_change()', 'admin' pointer may become dangling due to sched
      switch / removal caused by 'advance_sched()', and critical section
      protected by 'q->current_entry_lock' is too small to prevent from such
      a scenario (which causes use-after-free detected by KASAN). Fix this
      by prefer 'rcu_replace_pointer()' over 'rcu_assign_pointer()' to update
      'admin' immediately before an attempt to schedule freeing.
      
      Fixes: a3d43c0d ("taprio: Add support adding an admin schedule")
      Reported-by: default avatar <syzbot+b65e0af58423fc8a73aa@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa
      
      
      Acked-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Link: https://patch.msgid.link/20241018051339.418890-1-dmantipov@yandex.ru
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cccc9d59
    • Vladimir Oltean's avatar
      net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers · 965fa903
      Vladimir Oltean authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit 34d35b4e ]
      
      tcf_action_init() has logic for checking mismatches between action and
      filter offload flags (skip_sw/skip_hw). AFAIU, this is intended to run
      on the transition between the new tc_act_bind(flags) returning true (aka
      now gets bound to classifier) and tc_act_bind(act->tcfa_flags) returning
      false (aka action was not bound to classifier before). Otherwise, the
      check is skipped.
      
      For the case where an action is not standalone, but rather it was
      created by a classifier and is bound to it, tcf_action_init() skips the
      check entirely, and this means it allows mismatched flags to occur.
      
      Taking the matchall classifier code path as an example (with mirred as
      an action), the reason is the following:
      
       1 | mall_change()
       2 | -> mall_replace_hw_filter()
       3 |   -> tcf_exts_validate_ex()
       4 |      -> flags |= TCA_ACT_FLAGS_BIND;
       5 |      -> tcf_action_init()
       6 |         -> tcf_action_init_1()
       7 |            -> a_o->init()
       8 |               -> tcf_mirred_init()
       9 |                  -> tcf_idr_create_from_flags()
      10 |                     -> tcf_idr_create()
      11 |                        -> p->tcfa_flags = flags;
      12 |         -> tc_act_bind(flags))
      13 |         -> tc_act_bind(act->tcfa_flags)
      
      When invoked from tcf_exts_validate_ex() like matchall does (but other
      classifiers validate their extensions as well), tcf_action_init() runs
      in a call path where "flags" always contains TCA_ACT_FLAGS_BIND (set by
      line 4). So line 12 is always true, and line 13 is always true as well.
      No transition ever takes place, and the check is skipped.
      
      The code was added in this form in commit c86e0209 ("flow_offload:
      validate flags of filter and actions"), but I'm attributing the blame
      even earlier in that series, to when TCA_ACT_FLAGS_SKIP_HW and
      TCA_ACT_FLAGS_SKIP_SW were added to the UAPI.
      
      Following the development process of this change, the check did not
      always exist in this form. A change took place between v3 [1] and v4 [2],
      AFAIU due to review feedback that it doesn't make sense for action flags
      to be different than classifier flags. I think I agree with that
      feedback, but it was translated into code that omits enforcing this for
      "classic" actions created at the same time with the filters themselves.
      
      There are 3 more important cases to discuss. First there is this command:
      
      $ tc qdisc add dev eth0 clasct
      $ tc filter add dev eth0 ingress matchall skip_sw \
      	action mirred ingress mirror dev eth1
      
      which should be allowed, because prior to the concept of dedicated
      action flags, it used to work and it used to mean the action inherited
      the skip_sw/skip_hw flags from the classifier. It's not a mismatch.
      
      Then we have this command:
      
      $ tc qdisc add dev eth0 clasct
      $ tc filter add dev eth0 ingress matchall skip_sw \
      	action mirred ingress mirror dev eth1 skip_hw
      
      where there is a mismatch and it should be rejected.
      
      Finally, we have:
      
      $ tc qdisc add dev eth0 clasct
      $ tc filter add dev eth0 ingress matchall skip_sw \
      	action mirred ingress mirror dev eth1 skip_sw
      
      where the offload flags coincide, and this should be treated the same as
      the first command based on inheritance, and accepted.
      
      [1]: https://lore.kernel.org/netdev/20211028110646.13791-9-simon.horman@corigine.com/
      [2]: https://lore.kernel.org/netdev/20211118130805.23897-10-simon.horman@corigine.com/
      
      
      Fixes: 7adc5765 ("flow_offload: add skip_hw and skip_sw to control if offload the action")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://patch.msgid.link/20241017161049.3570037-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      965fa903
    • Eric Dumazet's avatar
      net: fix races in netdev_tx_sent_queue()/dev_watchdog() · 59d342c3
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 95ecba62 ]
      
      Some workloads hit the infamous dev_watchdog() message:
      
      "NETDEV WATCHDOG: eth0 (xxxx): transmit queue XX timed out"
      
      It seems possible to hit this even for perfectly normal
      BQL enabled drivers:
      
      1) Assume a TX queue was idle for more than dev->watchdog_timeo
         (5 seconds unless changed by the driver)
      
      2) Assume a big packet is sent, exceeding current BQL limit.
      
      3) Driver ndo_start_xmit() puts the packet in TX ring,
         and netdev_tx_sent_queue() is called.
      
      4) QUEUE_STATE_STACK_XOFF could be set from netdev_tx_sent_queue()
         before txq->trans_start has been written.
      
      5) txq->trans_start is written later, from netdev_start_xmit()
      
          if (rc == NETDEV_TX_OK)
                txq_trans_update(txq)
      
      dev_watchdog() running on another cpu could read the old
      txq->trans_start, and then see QUEUE_STATE_STACK_XOFF, because 5)
      did not happen yet.
      
      To solve the issue, write txq->trans_start right before one XOFF bit
      is set :
      
      - _QUEUE_STATE_DRV_XOFF from netif_tx_stop_queue()
      - __QUEUE_STATE_STACK_XOFF from netdev_tx_sent_queue()
      
      From dev_watchdog(), we have to read txq->state before txq->trans_start.
      
      Add memory barriers to enforce correct ordering.
      
      In the future, we could avoid writing over txq->trans_start for normal
      operations, and rename this field to txq->xoff_start_time.
      
      Fixes: bec251bc ("net: no longer stop all TX queues in dev_watchdog()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://patch.msgid.link/20241015194118.3951657-1-edumazet@google.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      59d342c3
    • Praveen Kumar Kannoju's avatar
      net/sched: adjust device watchdog timer to detect stopped queue at right time · 984ac9aa
      Praveen Kumar Kannoju authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 33fb988b ]
      
      Applications are sensitive to long network latency, particularly
      heartbeat monitoring ones. Longer the tx timeout recovery higher the
      risk with such applications on a production machines. This patch
      remedies, yet honoring device set tx timeout.
      
      Modify watchdog next timeout to be shorter than the device specified.
      Compute the next timeout be equal to device watchdog timeout less the
      how long ago queue stop had been done. At next watchdog timeout tx
      timeout handler is called into if still in stopped state. Either called
      or not called, restore the watchdog timeout back to device specified.
      
      Signed-off-by: default avatarPraveen Kumar Kannoju <praveen.kannoju@oracle.com>
      Link: https://lore.kernel.org/r/20240508133617.4424-1-praveen.kannoju@oracle.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: 95ecba62 ("net: fix races in netdev_tx_sent_queue()/dev_watchdog()")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      984ac9aa
    • Eric Dumazet's avatar
      net/sched: accept TCA_STAB only for root qdisc · 5bc26477
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 3cb7cf15 ]
      
      Most qdiscs maintain their backlog using qdisc_pkt_len(skb)
      on the assumption it is invariant between the enqueue()
      and dequeue() handlers.
      
      Unfortunately syzbot can crash a host rather easily using
      a TBF + SFQ combination, with an STAB on SFQ [1]
      
      We can't support TCA_STAB on arbitrary level, this would
      require to maintain per-qdisc storage.
      
      [1]
      [   88.796496] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   88.798611] #PF: supervisor read access in kernel mode
      [   88.799014] #PF: error_code(0x0000) - not-present page
      [   88.799506] PGD 0 P4D 0
      [   88.799829] Oops: Oops: 0000 [#1] SMP NOPTI
      [   88.800569] CPU: 14 UID: 0 PID: 2053 Comm: b371744477 Not tainted 6.12.0-rc1-virtme #1117
      [   88.801107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
      [   88.801779] RIP: 0010:sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq
      [ 88.802544] Code: 0f b7 50 12 48 8d 04 d5 00 00 00 00 48 89 d6 48 29 d0 48 8b 91 c0 01 00 00 48 c1 e0 03 48 01 c2 66 83 7a 1a 00 7e c0 48 8b 3a <4c> 8b 07 4c 89 02 49 89 50 08 48 c7 47 08 00 00 00 00 48 c7 07 00
      All code
      ========
         0:	0f b7 50 12          	movzwl 0x12(%rax),%edx
         4:	48 8d 04 d5 00 00 00 	lea    0x0(,%rdx,8),%rax
         b:	00
         c:	48 89 d6             	mov    %rdx,%rsi
         f:	48 29 d0             	sub    %rdx,%rax
        12:	48 8b 91 c0 01 00 00 	mov    0x1c0(%rcx),%rdx
        19:	48 c1 e0 03          	shl    $0x3,%rax
        1d:	48 01 c2             	add    %rax,%rdx
        20:	66 83 7a 1a 00       	cmpw   $0x0,0x1a(%rdx)
        25:	7e c0                	jle    0xffffffffffffffe7
        27:	48 8b 3a             	mov    (%rdx),%rdi
        2a:*	4c 8b 07             	mov    (%rdi),%r8		<-- trapping instruction
        2d:	4c 89 02             	mov    %r8,(%rdx)
        30:	49 89 50 08          	mov    %rdx,0x8(%r8)
        34:	48 c7 47 08 00 00 00 	movq   $0x0,0x8(%rdi)
        3b:	00
        3c:	48                   	rex.W
        3d:	c7                   	.byte 0xc7
        3e:	07                   	(bad)
      	...
      
      Code starting with the faulting instruction
      ===========================================
         0:	4c 8b 07             	mov    (%rdi),%r8
         3:	4c 89 02             	mov    %r8,(%rdx)
         6:	49 89 50 08          	mov    %rdx,0x8(%r8)
         a:	48 c7 47 08 00 00 00 	movq   $0x0,0x8(%rdi)
        11:	00
        12:	48                   	rex.W
        13:	c7                   	.byte 0xc7
        14:	07                   	(bad)
      	...
      [   88.803721] RSP: 0018:ffff9a1f892b7d58 EFLAGS: 00000206
      [   88.804032] RAX: 0000000000000000 RBX: ffff9a1f8420c800 RCX: ffff9a1f8420c800
      [   88.804560] RDX: ffff9a1f81bc1440 RSI: 0000000000000000 RDI: 0000000000000000
      [   88.805056] RBP: ffffffffc04bb0e0 R08: 0000000000000001 R09: 00000000ff7f9a1f
      [   88.805473] R10: 000000000001001b R11: 0000000000009a1f R12: 0000000000000140
      [   88.806194] R13: 0000000000000001 R14: ffff9a1f886df400 R15: ffff9a1f886df4ac
      [   88.806734] FS:  00007f445601a740(0000) GS:ffff9a2e7fd80000(0000) knlGS:0000000000000000
      [   88.807225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   88.807672] CR2: 0000000000000000 CR3: 000000050cc46000 CR4: 00000000000006f0
      [   88.808165] Call Trace:
      [   88.808459]  <TASK>
      [   88.808710] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
      [   88.809261] ? page_fault_oops (arch/x86/mm/fault.c:715)
      [   88.809561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
      [   88.809806] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
      [   88.810074] ? sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq
      [   88.810411] sfq_reset (net/sched/sch_sfq.c:525) sch_sfq
      [   88.810671] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036)
      [   88.810950] tbf_reset (./include/linux/timekeeping.h:169 net/sched/sch_tbf.c:334) sch_tbf
      [   88.811208] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036)
      [   88.811484] netif_set_real_num_tx_queues (./include/linux/spinlock.h:396 ./include/net/sch_generic.h:768 net/core/dev.c:2958)
      [   88.811870] __tun_detach (drivers/net/tun.c:590 drivers/net/tun.c:673)
      [   88.812271] tun_chr_close (drivers/net/tun.c:702 drivers/net/tun.c:3517)
      [   88.812505] __fput (fs/file_table.c:432 (discriminator 1))
      [   88.812735] task_work_run (kernel/task_work.c:230)
      [   88.813016] do_exit (kernel/exit.c:940)
      [   88.813372] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4))
      [   88.813639] ? handle_mm_fault (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/memcontrol.h:1022 ./include/linux/memcontrol.h:1045 ./include/linux/memcontrol.h:1052 mm/memory.c:5928 mm/memory.c:6088)
      [   88.813867] do_group_exit (kernel/exit.c:1070)
      [   88.814138] __x64_sys_exit_group (kernel/exit.c:1099)
      [   88.814490] x64_sys_call (??:?)
      [   88.814791] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
      [   88.815012] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      [   88.815495] RIP: 0033:0x7f44560f1975
      
      Fixes: 175f9c1b ("net_sched: Add size table for qdiscs")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Link: https://patch.msgid.link/20241007184130.3960565-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5bc26477
    • Dmitry Antipov's avatar
      net: sched: consistently use rcu_replace_pointer() in taprio_change() · 30f1fb19
      Dmitry Antipov authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit d5c45460 ]
      
      According to Vinicius (and carefully looking through the whole
      https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa
      
      
      once again), txtime branch of 'taprio_change()' is not going to
      race against 'advance_sched()'. But using 'rcu_replace_pointer()'
      in the former may be a good idea as well.
      
      Suggested-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Acked-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      30f1fb19
    • Toke Høiland-Jørgensen's avatar
      sched: sch_cake: fix bulk flow accounting logic for host fairness · 442bf6f7
      Toke Høiland-Jørgensen authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 546ea84d upstream.
      
      In sch_cake, we keep track of the count of active bulk flows per host,
      when running in dst/src host fairness mode, which is used as the
      round-robin weight when iterating through flows. The count of active
      bulk flows is updated whenever a flow changes state.
      
      This has a peculiar interaction with the hash collision handling: when a
      hash collision occurs (after the set-associative hashing), the state of
      the hash bucket is simply updated to match the new packet that collided,
      and if host fairness is enabled, that also means assigning new per-host
      state to the flow. For this reason, the bulk flow counters of the
      host(s) assigned to the flow are decremented, before new state is
      assigned (and the counters, which may not belong to the same host
      anymore, are incremented again).
      
      Back when this code was introduced, the host fairness mode was always
      enabled, so the decrement was unconditional. When the configuration
      flags were introduced the *increment* was made conditional, but
      the *decrement* was not. Which of course can lead to a spurious
      decrement (and associated wrap-around to U16_MAX).
      
      AFAICT, when host fairness is disabled, the decrement and wrap-around
      happens as soon as a hash collision occurs (which is not that common in
      itself, due to the set-associative hashing). However, in most cases this
      is harmless, as the value is only used when host fairness mode is
      enabled. So in order to trigger an array overflow, sch_cake has to first
      be configured with host fairness disabled, and while running in this
      mode, a hash collision has to occur to cause the overflow. Then, the
      qdisc has to be reconfigured to enable host fairness, which leads to the
      array out-of-bounds because the wrapped-around value is retained and
      used as an array index. It seems that syzbot managed to trigger this,
      which is quite impressive in its own right.
      
      This patch fixes the issue by introducing the same conditional check on
      decrement as is used on increment.
      
      The original bug predates the upstreaming of cake, but the commit listed
      in the Fixes tag touched that code, meaning that this patch won't apply
      before that.
      
      Fixes: 71263992 ("sch_cake: Make the dual modes fairer")
      Reported-by: default avatar <syzbot+7fe7b81d602cc1e6b94d@syzkaller.appspotmail.com>
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://patch.msgid.link/20240903160846.20909-1-toke@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      442bf6f7
    • Stephen Hemminger's avatar
      sch/netem: fix use after free in netem_dequeue · 0dba6f65
      Stephen Hemminger authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 3b3a2a9c upstream.
      
      If netem_dequeue() enqueues packet to inner qdisc and that qdisc
      returns __NET_XMIT_STOLEN. The packet is dropped but
      qdisc_tree_reduce_backlog() is not called to update the parent's
      q.qlen, leading to the similar use-after-free as Commit
      e04991a48dbaf382 ("netem: fix return value if duplicate enqueue
      fails")
      
      Commands to trigger KASAN UaF:
      
      ip link add type dummy
      ip link set lo up
      ip link set dummy0 up
      tc qdisc add dev lo parent root handle 1: drr
      tc filter add dev lo parent 1: basic classid 1:1
      tc class add dev lo classid 1:1 drr
      tc qdisc add dev lo parent 1:1 handle 2: netem
      tc qdisc add dev lo parent 2: handle 3: drr
      tc filter add dev lo parent 3: basic classid 3:1 action mirred egress
      redirect dev dummy0
      tc class add dev lo classid 3:1 drr
      ping -c1 -W0.01 localhost # Trigger bug
      tc class del dev lo classid 1:1
      tc class add dev lo classid 1:1 drr
      ping -c1 -W0.01 localhost # UaF
      
      Fixes: 50612537 ("netem: fix classful handling")
      Reported-by: default avatarBudimir Markovic <markovicbudimir@gmail.com>
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Link: https://patch.msgid.link/20240901182438.4992-1-stephen@networkplumber.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dba6f65
  4. Sep 17, 2024
    • Stephen Hemminger's avatar
      netem: fix return value if duplicate enqueue fails · 4c65bc58
      Stephen Hemminger authored
      
      [ Upstream commit c07ff859 ]
      
      There is a bug in netem_enqueue() introduced by
      commit 5845f706 ("net: netem: fix skb length BUG_ON in __skb_to_sgvec")
      that can lead to a use-after-free.
      
      This commit made netem_enqueue() always return NET_XMIT_SUCCESS
      when a packet is duplicated, which can cause the parent qdisc's q.qlen
      to be mistakenly incremented. When this happens qlen_notify() may be
      skipped on the parent during destruction, leaving a dangling pointer
      for some classful qdiscs like DRR.
      
      There are two ways for the bug happen:
      
      - If the duplicated packet is dropped by rootq->enqueue() and then
        the original packet is also dropped.
      - If rootq->enqueue() sends the duplicated packet to a different qdisc
        and the original packet is dropped.
      
      In both cases NET_XMIT_SUCCESS is returned even though no packets
      are enqueued at the netem qdisc.
      
      The fix is to defer the enqueue of the duplicate packet until after
      the original packet has been guaranteed to return NET_XMIT_SUCCESS.
      
      Fixes: 5845f706 ("net: netem: fix skb length BUG_ON in __skb_to_sgvec")
      Reported-by: default avatarBudimir Markovic <markovicbudimir@gmail.com>
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20240819175753.5151-1-stephen@networkplumber.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4c65bc58
    • Jakub Kicinski's avatar
      net: don't dump stack on queue timeout · 4261ae16
      Jakub Kicinski authored
      
      [ Upstream commit e316dd1c ]
      
      The top syzbot report for networking (#14 for the entire kernel)
      is the queue timeout splat. We kept it around for a long time,
      because in real life it provides pretty strong signal that
      something is wrong with the driver or the device.
      
      Removing it is also likely to break monitoring for those who
      track it as a kernel warning.
      
      Nevertheless, WARN()ings are best suited for catching kernel
      programming bugs. If a Tx queue gets starved due to a pause
      storm, priority configuration, or other weirdness - that's
      obviously a problem, but not a problem we can fix at
      the kernel level.
      
      Bite the bullet and convert the WARN() to a print.
      
      Before:
      
        NETDEV WATCHDOG: eni1np1 (netdevsim): transmit queue 0 timed out 1975 ms
        WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x39e/0x3b0
        [... completely pointless stack trace of a timer follows ...]
      
      Now:
      
        netdevsim netdevsim1 eni1np1: NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 1769 ms
      
      Alternatively we could mark the drivers which syzbot has
      learned to abuse as "print-instead-of-WARN" selectively.
      
      Reported-by: default avatar <syzbot+d55372214aff0faa1f1f@syzkaller.appspotmail.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4261ae16
    • Yajun Deng's avatar
      net: sched: Print msecs when transmit queue time out · f28db58b
      Yajun Deng authored
      
      [ Upstream commit 2f0f9465 ]
      
      The kernel will print several warnings in a short period of time
      when it stalls. Like this:
      
      First warning:
      [ 7100.097547] ------------[ cut here ]------------
      [ 7100.097550] NETDEV WATCHDOG: eno2 (xxx): transmit queue 8 timed out
      [ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467
                             dev_watchdog+0x260/0x270
      ...
      
      Second warning:
      [ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU
      [ 7147.756958] rcu:   24-....: (59999 ticks this GP) idle=546/1/0x400000000000000
                            softirq=367      3137/3673146 fqs=13844
      [ 7147.756960]        (t=60001 jiffies g=4322709 q=133381)
      [ 7147.756962] NMI backtrace for cpu 24
      ...
      
      We calculate that the transmit queue start stall should occur before
      7095s according to watchdog_timeo, the rcu start stall at 7087s.
      These two times are close together, it is difficult to confirm which
      happened first.
      
      To let users know the exact time the stall started, print msecs when
      the transmit queue time out.
      
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Stable-dep-of: e316dd1c ("net: don't dump stack on queue timeout")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f28db58b
    • Eric Dumazet's avatar
      sched: act_ct: take care of padding in struct zones_ht_key · ee992f25
      Eric Dumazet authored
      
      [ Upstream commit 2191a54f ]
      
      Blamed commit increased lookup key size from 2 bytes to 16 bytes,
      because zones_ht_key got a struct net pointer.
      
      Make sure rhashtable_lookup() is not using the padding bytes
      which are not initialized.
      
       BUG: KMSAN: uninit-value in rht_ptr_rcu include/linux/rhashtable.h:376 [inline]
       BUG: KMSAN: uninit-value in __rhashtable_lookup include/linux/rhashtable.h:607 [inline]
       BUG: KMSAN: uninit-value in rhashtable_lookup include/linux/rhashtable.h:646 [inline]
       BUG: KMSAN: uninit-value in rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline]
       BUG: KMSAN: uninit-value in tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329
        rht_ptr_rcu include/linux/rhashtable.h:376 [inline]
        __rhashtable_lookup include/linux/rhashtable.h:607 [inline]
        rhashtable_lookup include/linux/rhashtable.h:646 [inline]
        rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline]
        tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329
        tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408
        tcf_action_init_1+0x6cc/0xb30 net/sched/act_api.c:1425
        tcf_action_init+0x458/0xf00 net/sched/act_api.c:1488
        tcf_action_add net/sched/act_api.c:2061 [inline]
        tc_ctl_action+0x4be/0x19d0 net/sched/act_api.c:2118
        rtnetlink_rcv_msg+0x12fc/0x1410 net/core/rtnetlink.c:6647
        netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2550
        rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6665
        netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline]
        netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1357
        netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1901
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:745
        ____sys_sendmsg+0x877/0xb60 net/socket.c:2597
        ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2651
        __sys_sendmsg net/socket.c:2680 [inline]
        __do_sys_sendmsg net/socket.c:2689 [inline]
        __se_sys_sendmsg net/socket.c:2687 [inline]
        __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2687
        x64_sys_call+0x2dd6/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:47
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Local variable key created at:
        tcf_ct_flow_table_get+0x4a/0x2260 net/sched/act_ct.c:324
        tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408
      
      Fixes: 88c67aeb ("sched: act_ct: add netns into the key of tcf_ct_flow_table")
      Reported-by: default avatar <syzbot+1b5e4e187cc586d05ea0@syzkaller.appspotmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ee992f25
  5. Aug 12, 2024
    • Chengen Du's avatar
      net/sched: Fix UAF when resolving a clash · ad4e1d9c
      Chengen Du authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 26488172 ]
      
      KASAN reports the following UAF:
      
       BUG: KASAN: slab-use-after-free in tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct]
       Read of size 1 at addr ffff888c07603600 by task handler130/6469
      
       Call Trace:
        <IRQ>
        dump_stack_lvl+0x48/0x70
        print_address_description.constprop.0+0x33/0x3d0
        print_report+0xc0/0x2b0
        kasan_report+0xd0/0x120
        __asan_load1+0x6c/0x80
        tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct]
        tcf_ct_act+0x886/0x1350 [act_ct]
        tcf_action_exec+0xf8/0x1f0
        fl_classify+0x355/0x360 [cls_flower]
        __tcf_classify+0x1fd/0x330
        tcf_classify+0x21c/0x3c0
        sch_handle_ingress.constprop.0+0x2c5/0x500
        __netif_receive_skb_core.constprop.0+0xb25/0x1510
        __netif_receive_skb_list_core+0x220/0x4c0
        netif_receive_skb_list_internal+0x446/0x620
        napi_complete_done+0x157/0x3d0
        gro_cell_poll+0xcf/0x100
        __napi_poll+0x65/0x310
        net_rx_action+0x30c/0x5c0
        __do_softirq+0x14f/0x491
        __irq_exit_rcu+0x82/0xc0
        irq_exit_rcu+0xe/0x20
        common_interrupt+0xa1/0xb0
        </IRQ>
        <TASK>
        asm_common_interrupt+0x27/0x40
      
       Allocated by task 6469:
        kasan_save_stack+0x38/0x70
        kasan_set_track+0x25/0x40
        kasan_save_alloc_info+0x1e/0x40
        __kasan_krealloc+0x133/0x190
        krealloc+0xaa/0x130
        nf_ct_ext_add+0xed/0x230 [nf_conntrack]
        tcf_ct_act+0x1095/0x1350 [act_ct]
        tcf_action_exec+0xf8/0x1f0
        fl_classify+0x355/0x360 [cls_flower]
        __tcf_classify+0x1fd/0x330
        tcf_classify+0x21c/0x3c0
        sch_handle_ingress.constprop.0+0x2c5/0x500
        __netif_receive_skb_core.constprop.0+0xb25/0x1510
        __netif_receive_skb_list_core+0x220/0x4c0
        netif_receive_skb_list_internal+0x446/0x620
        napi_complete_done+0x157/0x3d0
        gro_cell_poll+0xcf/0x100
        __napi_poll+0x65/0x310
        net_rx_action+0x30c/0x5c0
        __do_softirq+0x14f/0x491
      
       Freed by task 6469:
        kasan_save_stack+0x38/0x70
        kasan_set_track+0x25/0x40
        kasan_save_free_info+0x2b/0x60
        ____kasan_slab_free+0x180/0x1f0
        __kasan_slab_free+0x12/0x30
        slab_free_freelist_hook+0xd2/0x1a0
        __kmem_cache_free+0x1a2/0x2f0
        kfree+0x78/0x120
        nf_conntrack_free+0x74/0x130 [nf_conntrack]
        nf_ct_destroy+0xb2/0x140 [nf_conntrack]
        __nf_ct_resolve_clash+0x529/0x5d0 [nf_conntrack]
        nf_ct_resolve_clash+0xf6/0x490 [nf_conntrack]
        __nf_conntrack_confirm+0x2c6/0x770 [nf_conntrack]
        tcf_ct_act+0x12ad/0x1350 [act_ct]
        tcf_action_exec+0xf8/0x1f0
        fl_classify+0x355/0x360 [cls_flower]
        __tcf_classify+0x1fd/0x330
        tcf_classify+0x21c/0x3c0
        sch_handle_ingress.constprop.0+0x2c5/0x500
        __netif_receive_skb_core.constprop.0+0xb25/0x1510
        __netif_receive_skb_list_core+0x220/0x4c0
        netif_receive_skb_list_internal+0x446/0x620
        napi_complete_done+0x157/0x3d0
        gro_cell_poll+0xcf/0x100
        __napi_poll+0x65/0x310
        net_rx_action+0x30c/0x5c0
        __do_softirq+0x14f/0x491
      
      The ct may be dropped if a clash has been resolved but is still passed to
      the tcf_ct_flow_table_process_conn function for further usage. This issue
      can be fixed by retrieving ct from skb again after confirming conntrack.
      
      Fixes: 0cc254e5 ("net/sched: act_ct: Offload connections with commit action")
      Co-developed-by: default avatarGerald Yang <gerald.yang@canonical.com>
      Signed-off-by: default avatarGerald Yang <gerald.yang@canonical.com>
      Signed-off-by: default avatarChengen Du <chengen.du@canonical.com>
      Link: https://patch.msgid.link/20240710053747.13223-1-chengen.du@canonical.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ad4e1d9c
  6. Jul 11, 2024
    • Davide Caratti's avatar
      net/sched: unregister lockdep keys in qdisc_create/qdisc_alloc error path · 5bf3df45
      Davide Caratti authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 86735b57 upstream.
      
      Naresh and Eric report several errors (corrupted elements in the dynamic
      key hash list), when running tdc.py or syzbot. The error path of
      qdisc_alloc() and qdisc_create() frees the qdisc memory, but it forgets
      to unregister the lockdep key, thus causing use-after-free like the
      following one:
      
       ==================================================================
       BUG: KASAN: slab-use-after-free in lockdep_register_key+0x5f2/0x700
       Read of size 8 at addr ffff88811236f2a8 by task ip/7925
      
       CPU: 26 PID: 7925 Comm: ip Kdump: loaded Not tainted 6.9.0-rc2+ #648
       Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0  07/26/2013
       Call Trace:
        <TASK>
        dump_stack_lvl+0x7c/0xc0
        print_report+0xc9/0x610
        kasan_report+0x89/0xc0
        lockdep_register_key+0x5f2/0x700
        qdisc_alloc+0x21d/0xb60
        qdisc_create_dflt+0x63/0x3c0
        attach_one_default_qdisc.constprop.37+0x8e/0x170
        dev_activate+0x4bd/0xc30
        __dev_open+0x275/0x380
        __dev_change_flags+0x3f1/0x570
        dev_change_flags+0x7c/0x160
        do_setlink+0x1ea1/0x34b0
        __rtnl_newlink+0x8c9/0x1510
        rtnl_newlink+0x61/0x90
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
       RIP: 0033:0x7f9503f4fa07
       Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
       RSP: 002b:00007fff6c729068 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 000000006630c681 RCX: 00007f9503f4fa07
       RDX: 0000000000000000 RSI: 00007fff6c7290d0 RDI: 0000000000000003
       RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000078
       R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000001
       R13: 00007fff6c729180 R14: 0000000000000000 R15: 000055bf67dd9040
        </TASK>
      
       Allocated by task 7745:
        kasan_save_stack+0x1c/0x40
        kasan_save_track+0x10/0x30
        __kasan_kmalloc+0x7b/0x90
        __kmalloc_node+0x1ff/0x460
        qdisc_alloc+0xae/0xb60
        qdisc_create+0xdd/0xfb0
        tc_modify_qdisc+0x37e/0x1960
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
      
       Freed by task 7745:
        kasan_save_stack+0x1c/0x40
        kasan_save_track+0x10/0x30
        kasan_save_free_info+0x36/0x60
        __kasan_slab_free+0xfe/0x180
        kfree+0x113/0x380
        qdisc_create+0xafb/0xfb0
        tc_modify_qdisc+0x37e/0x1960
        rtnetlink_rcv_msg+0x2f0/0xbc0
        netlink_rcv_skb+0x120/0x380
        netlink_unicast+0x420/0x630
        netlink_sendmsg+0x732/0xbc0
        __sock_sendmsg+0x1ea/0x280
        ____sys_sendmsg+0x5a9/0x990
        ___sys_sendmsg+0xf1/0x180
        __sys_sendmsg+0xd3/0x180
        do_syscall_64+0x96/0x180
        entry_SYSCALL_64_after_hwframe+0x71/0x79
      
      Fix this ensuring that lockdep_unregister_key() is called before the
      qdisc struct is freed, also in the error path of qdisc_create() and
      qdisc_alloc().
      
      Fixes: af0cb3fa ("net/sched: fix false lockdep warning on qdisc root lock")
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Closes: https://lore.kernel.org/netdev/20240429221706.1492418-1-naresh.kamboju@linaro.org/
      
      
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/2aa1ca0c0a3aa0acc15925c666c777a4b5de553c.1714496886.git.dcaratti@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bf3df45
    • Xin Long's avatar
      sched: act_ct: add netns into the key of tcf_ct_flow_table · 3d892903
      Xin Long authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 88c67aeb ]
      
      zones_ht is a global hashtable for flow_table with zone as key. However,
      it does not consider netns when getting a flow_table from zones_ht in
      tcf_ct_init(), and it means an act_ct action in netns A may get a
      flow_table that belongs to netns B if it has the same zone value.
      
      In Shuang's test with the TOPO:
      
        tcf2_c <---> tcf2_sw1 <---> tcf2_sw2 <---> tcf2_s
      
      tcf2_sw1 and tcf2_sw2 saw the same flow and used the same flow table,
      which caused their ct entries entering unexpected states and the
      TCP connection not able to end normally.
      
      This patch fixes the issue simply by adding netns into the key of
      tcf_ct_flow_table so that an act_ct action gets a flow_table that
      belongs to its own netns in tcf_ct_init().
      
      Note that for easy coding we don't use tcf_ct_flow_table.nf_ft.net,
      as the ct_ft is initialized after inserting it to the hashtable in
      tcf_ct_flow_table_get() and also it requires to implement several
      functions in rhashtable_params including hashfn, obj_hashfn and
      obj_cmpfn.
      
      Fixes: 64ff70b8 ("net/sched: act_ct: Offload established connections to flow table")
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/1db5b6cc6902c5fc6f8c6cbd85494a2008087be5.1718488050.git.lucien.xin@gmail.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3d892903
    • David Ruth's avatar
      net/sched: act_api: fix possible infinite loop in tcf_idr_check_alloc() · e4eab91c
      David Ruth authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit d8643198 ]
      
      syzbot found hanging tasks waiting on rtnl_lock [1]
      
      A reproducer is available in the syzbot bug.
      
      When a request to add multiple actions with the same index is sent, the
      second request will block forever on the first request. This holds
      rtnl_lock, and causes tasks to hang.
      
      Return -EAGAIN to prevent infinite looping, while keeping documented
      behavior.
      
      [1]
      
      INFO: task kworker/1:0:5088 blocked for more than 143 seconds.
      Not tainted 6.9.0-rc4-syzkaller-00173-g3cdb45594619 #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/1:0 state:D stack:23744 pid:5088 tgid:5088 ppid:2 flags:0x00004000
      Workqueue: events_power_efficient reg_check_chans_work
      Call Trace:
      <TASK>
      context_switch kernel/sched/core.c:5409 [inline]
      __schedule+0xf15/0x5d00 kernel/sched/core.c:6746
      __schedule_loop kernel/sched/core.c:6823 [inline]
      schedule+0xe7/0x350 kernel/sched/core.c:6838
      schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:6895
      __mutex_lock_common kernel/locking/mutex.c:684 [inline]
      __mutex_lock+0x5b8/0x9c0 kernel/locking/mutex.c:752
      wiphy_lock include/net/cfg80211.h:5953 [inline]
      reg_leave_invalid_chans net/wireless/reg.c:2466 [inline]
      reg_check_chans_work+0x10a/0x10e0 net/wireless/reg.c:2481
      
      Fixes: 0190c1d4 ("net: sched: atomically check-allocate action")
      Reported-by: default avatar <syzbot+b87c222546179f4513a7@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=b87c222546179f4513a7
      
      
      Signed-off-by: default avatarDavid Ruth <druth@chromium.org>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20240614190326.1349786-1-druth@chromium.org
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e4eab91c
    • Pedro Tammela's avatar
      net/sched: act_api: rely on rcu in tcf_idr_check_alloc · 9816bf9e
      Pedro Tammela authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 4b55e867 ]
      
      Instead of relying only on the idrinfo->lock mutex for
      bind/alloc logic, rely on a combination of rcu + mutex + atomics
      to better scale the case where multiple rtnl-less filters are
      binding to the same action object.
      
      Action binding happens when an action index is specified explicitly and
      an action exists which such index exists. Example:
        tc actions add action drop index 1
        tc filter add ... matchall action drop index 1
        tc filter add ... matchall action drop index 1
        tc filter add ... matchall action drop index 1
        tc filter ls ...
           filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
           not_in_hw
                 action order 1: gact action drop
                  random type none pass val 0
                  index 1 ref 4 bind 3
      
         filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
           not_in_hw
                 action order 1: gact action drop
                  random type none pass val 0
                  index 1 ref 4 bind 3
      
         filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
           not_in_hw
                 action order 1: gact action drop
                  random type none pass val 0
                  index 1 ref 4 bind 3
      
      When no index is specified, as before, grab the mutex and allocate
      in the idr the next available id. In this version, as opposed to before,
      it's simplified to store the -EBUSY pointer instead of the previous
      alloc + replace combination.
      
      When an index is specified, rely on rcu to find if there's an object in
      such index. If there's none, fallback to the above, serializing on the
      mutex and reserving the specified id. If there's one, it can be an -EBUSY
      pointer, in which case we just try again until it's an action, or an action.
      Given the rcu guarantees, the action found could be dead and therefore
      we need to bump the refcount if it's not 0, handling the case it's
      in fact 0.
      
      As bind and the action refcount are already atomics, these increments can
      happen without the mutex protection while many tcf_idr_check_alloc race
      to bind to the same action instance.
      
      In case binding encounters a parallel delete or add, it will return
      -EAGAIN in order to try again. Both filter and action apis already
      have the retry machinery in-place. In case it's an unlocked filter it
      retries under the rtnl lock.
      
      Signed-off-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Link: https://lore.kernel.org/r/20231211181807.96028-2-pctammela@mojatatu.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: d8643198 ("net/sched: act_api: fix possible infinite loop in tcf_idr_check_alloc()")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9816bf9e
    • Davide Caratti's avatar
      net/sched: fix false lockdep warning on qdisc root lock · 329e222b
      Davide Caratti authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit af0cb3fa ]
      
      Xiumei and Christoph reported the following lockdep splat, complaining of
      the qdisc root lock being taken twice:
      
       ============================================
       WARNING: possible recursive locking detected
       6.7.0-rc3+ #598 Not tainted
       --------------------------------------------
       swapper/2/0 is trying to acquire lock:
       ffff888177190110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
      
       but task is already holding lock:
       ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&sch->q.lock);
         lock(&sch->q.lock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       5 locks held by swapper/2/0:
        #0: ffff888135a09d98 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x11a/0x510
        #1: ffffffffaaee5260 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x2c0/0x1ed0
        #2: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
        #3: ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
        #4: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
      
       stack backtrace:
       CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc3+ #598
       Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
       Call Trace:
        <IRQ>
        dump_stack_lvl+0x4a/0x80
        __lock_acquire+0xfdd/0x3150
        lock_acquire+0x1ca/0x540
        _raw_spin_lock+0x34/0x80
        __dev_queue_xmit+0x1560/0x2e70
        tcf_mirred_act+0x82e/0x1260 [act_mirred]
        tcf_action_exec+0x161/0x480
        tcf_classify+0x689/0x1170
        prio_enqueue+0x316/0x660 [sch_prio]
        dev_qdisc_enqueue+0x46/0x220
        __dev_queue_xmit+0x1615/0x2e70
        ip_finish_output2+0x1218/0x1ed0
        __ip_finish_output+0x8b3/0x1350
        ip_output+0x163/0x4e0
        igmp_ifc_timer_expire+0x44b/0x930
        call_timer_fn+0x1a2/0x510
        run_timer_softirq+0x54d/0x11a0
        __do_softirq+0x1b3/0x88f
        irq_exit_rcu+0x18f/0x1e0
        sysvec_apic_timer_interrupt+0x6f/0x90
        </IRQ>
      
      This happens when TC does a mirred egress redirect from the root qdisc of
      device A to the root qdisc of device B. As long as these two locks aren't
      protecting the same qdisc, they can be acquired in chain: add a per-qdisc
      lockdep key to silence false warnings.
      This dynamic key should safely replace the static key we have in sch_htb:
      it was added to allow enqueueing to the device "direct qdisc" while still
      holding the qdisc root lock.
      
      v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)
      
      CC: Maxim Mikityanskiy <maxim@isovalent.com>
      CC: Xiumei Mu <xmu@redhat.com>
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/451
      
      
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Link: https://lore.kernel.org/r/7dc06d6158f72053cf877a82e2a7a5bd23692faa.1713448007.git.dcaratti@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      329e222b
    • Eric Dumazet's avatar
      net/sched: taprio: always validate TCA_TAPRIO_ATTR_PRIOMAP · d86961ff
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit f921a58a ]
      
      If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
      taprio_parse_mqprio_opt() must validate it, or userspace
      can inject arbitrary data to the kernel, the second time
      taprio_change() is called.
      
      First call (with valid attributes) sets dev->num_tc
      to a non zero value.
      
      Second call (with arbitrary mqprio attributes)
      returns early from taprio_parse_mqprio_opt()
      and bad things can happen.
      
      Fixes: a3d43c0d ("taprio: Add support adding an admin schedule")
      Reported-by: default avatarNoam Rathaus <noamr@ssd-disclosure.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d86961ff
    • Hangyu Hua's avatar
      net: sched: sch_multiq: fix possible OOB write in multiq_tune() · 5dede9f7
      Hangyu Hua authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit affc18fd ]
      
      q->bands will be assigned to qopt->bands to execute subsequent code logic
      after kmalloc. So the old q->bands should not be used in kmalloc.
      Otherwise, an out-of-bounds write will occur.
      
      Fixes: c2999f7f ("net: sched: multiq: don't call qdisc_put() while holding tree lock")
      Signed-off-by: default avatarHangyu Hua <hbh25y@gmail.com>
      Acked-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5dede9f7
  7. Apr 11, 2024
    • Eric Dumazet's avatar
      net/sched: fix lockdep splat in qdisc_tree_reduce_backlog() · 09665dbf
      Eric Dumazet authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 7eb32236 upstream.
      
      qdisc_tree_reduce_backlog() is called with the qdisc lock held,
      not RTNL.
      
      We must use qdisc_lookup_rcu() instead of qdisc_lookup()
      
      syzbot reported:
      
      WARNING: suspicious RCU usage
      6.1.74-syzkaller #0 Not tainted
      -----------------------------
      net/sched/sch_api.c:305 suspicious rcu_dereference_protected() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by udevd/1142:
        #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:306 [inline]
        #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:747 [inline]
        #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: net_tx_action+0x64a/0x970 net/core/dev.c:5282
        #1: ffff888171861108 (&sch->q.lock){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
        #1: ffff888171861108 (&sch->q.lock){+.-.}-{2:2}, at: net_tx_action+0x754/0x970 net/core/dev.c:5297
        #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:306 [inline]
        #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:747 [inline]
        #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: qdisc_tree_reduce_backlog+0x84/0x580 net/sched/sch_api.c:792
      
      stack backtrace:
      CPU: 1 PID: 1142 Comm: udevd Not tainted 6.1.74-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call Trace:
       <TASK>
        [<ffffffff85b85f14>] __dump_stack lib/dump_stack.c:88 [inline]
        [<ffffffff85b85f14>] dump_stack_lvl+0x1b1/0x28f lib/dump_stack.c:106
        [<ffffffff85b86007>] dump_stack+0x15/0x1e lib/dump_stack.c:113
        [<ffffffff81802299>] lockdep_rcu_suspicious+0x1b9/0x260 kernel/locking/lockdep.c:6592
        [<ffffffff84f0054c>] qdisc_lookup+0xac/0x6f0 net/sched/sch_api.c:305
        [<ffffffff84f037c3>] qdisc_tree_reduce_backlog+0x243/0x580 net/sched/sch_api.c:811
        [<ffffffff84f5b78c>] pfifo_tail_enqueue+0x32c/0x4b0 net/sched/sch_fifo.c:51
        [<ffffffff84fbcf63>] qdisc_enqueue include/net/sch_generic.h:833 [inline]
        [<ffffffff84fbcf63>] netem_dequeue+0xeb3/0x15d0 net/sched/sch_netem.c:723
        [<ffffffff84eecab9>] dequeue_skb net/sched/sch_generic.c:292 [inline]
        [<ffffffff84eecab9>] qdisc_restart net/sched/sch_generic.c:397 [inline]
        [<ffffffff84eecab9>] __qdisc_run+0x249/0x1e60 net/sched/sch_generic.c:415
        [<ffffffff84d7aa96>] qdisc_run+0xd6/0x260 include/net/pkt_sched.h:125
        [<ffffffff84d85d29>] net_tx_action+0x7c9/0x970 net/core/dev.c:5313
        [<ffffffff85e002bd>] __do_softirq+0x2bd/0x9bd kernel/softirq.c:616
        [<ffffffff81568bca>] invoke_softirq kernel/softirq.c:447 [inline]
        [<ffffffff81568bca>] __irq_exit_rcu+0xca/0x230 kernel/softirq.c:700
        [<ffffffff81568ae9>] irq_exit_rcu+0x9/0x20 kernel/softirq.c:712
        [<ffffffff85b89f52>] sysvec_apic_timer_interrupt+0x42/0x90 arch/x86/kernel/apic/apic.c:1107
        [<ffffffff85c00ccb>] asm_sysvec_apic_timer_interrupt+0x1b/0x20 arch/x86/include/asm/idtentry.h:656
      
      Fixes: d636fc5d ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20240402134133.2352776-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09665dbf
Loading