Skip to content
Snippets Groups Projects
  1. Mar 03, 2025
  2. Feb 03, 2025
    • Ilya Maximets's avatar
      openvswitch: fix lockup on tx to unregistering netdev with carrier · 49bf0ed9
      Ilya Maximets authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 47e55e4b410f7d552e43011baa5be1aab4093990 ]
      
      Commit in a fixes tag attempted to fix the issue in the following
      sequence of calls:
      
          do_output
          -> ovs_vport_send
             -> dev_queue_xmit
                -> __dev_queue_xmit
                   -> netdev_core_pick_tx
                      -> skb_tx_hash
      
      When device is unregistering, the 'dev->real_num_tx_queues' goes to
      zero and the 'while (unlikely(hash >= qcount))' loop inside the
      'skb_tx_hash' becomes infinite, locking up the core forever.
      
      But unfortunately, checking just the carrier status is not enough to
      fix the issue, because some devices may still be in unregistering
      state while reporting carrier status OK.
      
      One example of such device is a net/dummy.  It sets carrier ON
      on start, but it doesn't implement .ndo_stop to set the carrier off.
      And it makes sense, because dummy doesn't really have a carrier.
      Therefore, while this device is unregistering, it's still easy to hit
      the infinite loop in the skb_tx_hash() from the OVS datapath.  There
      might be other drivers that do the same, but dummy by itself is
      important for the OVS ecosystem, because it is frequently used as a
      packet sink for tcpdump while debugging OVS deployments.  And when the
      issue is hit, the only way to recover is to reboot.
      
      Fix that by also checking if the device is running.  The running
      state is handled by the net core during unregistering, so it covers
      unregistering case better, and we don't really need to send packets
      to devices that are not running anyway.
      
      While only checking the running state might be enough, the carrier
      check is preserved.  The running and the carrier states seem disjoined
      throughout the code and different drivers.  And other core functions
      like __dev_direct_xmit() check both before attempting to transmit
      a packet.  So, it seems safer to check both flags in OVS as well.
      
      Fixes: 066b8678 ("net: openvswitch: fix race on port output")
      Reported-by: default avatarFriedrich Weber <f.weber@proxmox.com>
      Closes: https://mail.openvswitch.org/pipermail/ovs-discuss/2025-January/053423.html
      
      
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Tested-by: default avatarFriedrich Weber <f.weber@proxmox.com>
      Reviewed-by: default avatarAaron Conole <aconole@redhat.com>
      Link: https://patch.msgid.link/20250109122225.4034688-1-i.maximets@ovn.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      49bf0ed9
  3. Jul 11, 2024
    • Aaron Conole's avatar
      openvswitch: Set the skbuff pkt_type for proper pmtud support. · 2df2454d
      Aaron Conole authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 30a92c9e ]
      
      Open vSwitch is originally intended to switch at layer 2, only dealing with
      Ethernet frames.  With the introduction of l3 tunnels support, it crossed
      into the realm of needing to care a bit about some routing details when
      making forwarding decisions.  If an oversized packet would need to be
      fragmented during this forwarding decision, there is a chance for pmtu
      to get involved and generate a routing exception.  This is gated by the
      skbuff->pkt_type field.
      
      When a flow is already loaded into the openvswitch module this field is
      set up and transitioned properly as a packet moves from one port to
      another.  In the case that a packet execute is invoked after a flow is
      newly installed this field is not properly initialized.  This causes the
      pmtud mechanism to omit sending the required exception messages across
      the tunnel boundary and a second attempt needs to be made to make sure
      that the routing exception is properly setup.  To fix this, we set the
      outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
      to the openvswitch module via a port device or packet command.
      
      Even for bridge ports as users, the pkt_type needs to be reset when
      doing the transmit as the packet is truly outgoing and routing needs
      to get involved post packet transformations, in the case of
      VXLAN/GENEVE/udp-tunnel packets.  In general, the pkt_type on output
      gets ignored, since we go straight to the driver, but in the case of
      tunnel ports they go through IP routing layer.
      
      This issue is periodically encountered in complex setups, such as large
      openshift deployments, where multiple sets of tunnel traversal occurs.
      A way to recreate this is with the ovn-heater project that can setup
      a networking environment which mimics such large deployments.  We need
      larger environments for this because we need to ensure that flow
      misses occur.  In these environment, without this patch, we can see:
      
        ./ovn_cluster.sh start
        podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
        podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
      sent into the server.
      
      With this patch, setting the pkt_type, we see the following:
      
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
        ping: local error: message too long, mtu=1222
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      In this case, the first ping request receives the FRAG_NEEDED message and
      a local routing exception is created.
      
      Tested-by: default avatarJaime Caamano <jcaamano@redhat.com>
      Reported-at: https://issues.redhat.com/browse/FDP-164
      
      
      Fixes: 58264848 ("openvswitch: Add vxlan tunneling support.")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2df2454d
    • Ilya Maximets's avatar
      net: openvswitch: fix overwriting ct original tuple for ICMPv6 · 74380c19
      Ilya Maximets authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 7c988176 ]
      
      OVS_PACKET_CMD_EXECUTE has 3 main attributes:
       - OVS_PACKET_ATTR_KEY - Packet metadata in a netlink format.
       - OVS_PACKET_ATTR_PACKET - Binary packet content.
       - OVS_PACKET_ATTR_ACTIONS - Actions to execute on the packet.
      
      OVS_PACKET_ATTR_KEY is parsed first to populate sw_flow_key structure
      with the metadata like conntrack state, input port, recirculation id,
      etc.  Then the packet itself gets parsed to populate the rest of the
      keys from the packet headers.
      
      Whenever the packet parsing code starts parsing the ICMPv6 header, it
      first zeroes out fields in the key corresponding to Neighbor Discovery
      information even if it is not an ND packet.
      
      It is an 'ipv6.nd' field.  However, the 'ipv6' is a union that shares
      the space between 'nd' and 'ct_orig' that holds the original tuple
      conntrack metadata parsed from the OVS_PACKET_ATTR_KEY.
      
      ND packets should not normally have conntrack state, so it's fine to
      share the space, but normal ICMPv6 Echo packets or maybe other types of
      ICMPv6 can have the state attached and it should not be overwritten.
      
      The issue results in all but the last 4 bytes of the destination
      address being wiped from the original conntrack tuple leading to
      incorrect packet matching and potentially executing wrong actions
      in case this packet recirculates within the datapath or goes back
      to userspace.
      
      ND fields should not be accessed in non-ND packets, so not clearing
      them should be fine.  Executing memset() only for actual ND packets to
      avoid the issue.
      
      Initializing the whole thing before parsing is needed because ND packet
      may not contain all the options.
      
      The issue only affects the OVS_PACKET_CMD_EXECUTE path and doesn't
      affect packets entering OVS datapath from network interfaces, because
      in this case CT metadata is populated from skb after the packet is
      already parsed.
      
      Fixes: 9dd7f890 ("openvswitch: Add original direction conntrack tuple to sw_flow_key.")
      Reported-by: default avatarAntonin Bas <antonin.bas@broadcom.com>
      Closes: https://github.com/openvswitch/ovs-issues/issues/327
      
      
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20240509094228.1035477-1-i.maximets@ovn.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      74380c19
  4. May 13, 2024
  5. Mar 11, 2024
  6. Jan 23, 2024
  7. Sep 11, 2023
    • Jakub Kicinski's avatar
      net: openvswitch: reject negative ifindex · 3b39a40c
      Jakub Kicinski authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit a552bfa1 ]
      
      Recent changes in net-next (commit 759ab1ed ("net: store netdevs
      in an xarray")) refactored the handling of pre-assigned ifindexes
      and let syzbot surface a latent problem in ovs. ovs does not validate
      ifindex, making it possible to create netdev ports with negative
      ifindex values. It's easy to repro with YNL:
      
      $ ./cli.py --spec netlink/specs/ovs_datapath.yaml \
               --do new \
      	 --json '{"upcall-pid": 1, "name":"my-dp"}'
      $ ./cli.py --spec netlink/specs/ovs_vport.yaml \
      	 --do new \
      	 --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'
      
      $ ip link show
      -65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff
      ...
      
      Validate the inputs. Now the second command correctly returns:
      
      $ ./cli.py --spec netlink/specs/ovs_vport.yaml \
      	 --do new \
      	 --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}'
      
      lib.ynl.NlError: Netlink error: Numerical result out of range
      nl_len = 108 (92) nl_flags = 0x300 nl_type = 2
      	error: -34	extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'}
      
      Accept 0 since it used to be silently ignored.
      
      Fixes: 54c4ef34 ("openvswitch: allow specifying ifindex of new interfaces")
      Reported-by: default avatar <syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarAaron Conole <aconole@redhat.com>
      Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3b39a40c
  8. Apr 20, 2023
    • Felix Huettner's avatar
      net: openvswitch: fix race on port output · 644b3051
      Felix Huettner authored
      
      [ Upstream commit 066b8678 ]
      
      assume the following setup on a single machine:
      1. An openvswitch instance with one bridge and default flows
      2. two network namespaces "server" and "client"
      3. two ovs interfaces "server" and "client" on the bridge
      4. for each ovs interface a veth pair with a matching name and 32 rx and
         tx queues
      5. move the ends of the veth pairs to the respective network namespaces
      6. assign ip addresses to each of the veth ends in the namespaces (needs
         to be the same subnet)
      7. start some http server on the server network namespace
      8. test if a client in the client namespace can reach the http server
      
      when following the actions below the host has a chance of getting a cpu
      stuck in a infinite loop:
      1. send a large amount of parallel requests to the http server (around
         3000 curls should work)
      2. in parallel delete the network namespace (do not delete interfaces or
         stop the server, just kill the namespace)
      
      there is a low chance that this will cause the below kernel cpu stuck
      message. If this does not happen just retry.
      Below there is also the output of bpftrace for the functions mentioned
      in the output.
      
      The series of events happening here is:
      1. the network namespace is deleted calling
         `unregister_netdevice_many_notify` somewhere in the process
      2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and
         then runs `synchronize_net`
      3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER`
      4. this is then handled by `dp_device_event` which calls
         `ovs_netdev_detach_dev` (if a vport is found, which is the case for
         the veth interface attached to ovs)
      5. this removes the rx_handlers of the device but does not prevent
         packages to be sent to the device
      6. `dp_device_event` then queues the vport deletion to work in
         background as a ovs_lock is needed that we do not hold in the
         unregistration path
      7. `unregister_netdevice_many_notify` continues to call
         `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0
      8. port deletion continues (but details are not relevant for this issue)
      9. at some future point the background task deletes the vport
      
      If after 7. but before 9. a packet is send to the ovs vport (which is
      not deleted at this point in time) which forwards it to the
      `dev_queue_xmit` flow even though the device is unregistering.
      In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is
      a while loop (if the packet has a rx_queue recorded) that is infinite if
      `dev->real_num_tx_queues` is zero.
      
      To prevent this from happening we update `do_output` to handle devices
      without carrier the same as if the device is not found (which would
      be the code path after 9. is done).
      
      Additionally we now produce a warning in `skb_tx_hash` if we will hit
      the infinite loop.
      
      bpftrace (first word is function name):
      
      __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
      netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1
      dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2
      ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2
      netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2
      dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2
      dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2
      dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2
      netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024
      synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024
      ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
      __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
      netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2
      broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024
      ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2
      synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604
      
      stuck message:
      
      watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279]
      Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover
      CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu
      Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:netdev_pick_tx+0xf1/0x320
      Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01
      RSP: 0018:ffffb78b40298820 EFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000
      RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000
      R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000
      FS:  00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0
      Call Trace:
       <IRQ>
       netdev_core_pick_tx+0xa4/0xb0
       __dev_queue_xmit+0xf8/0x510
       ? __bpf_prog_exit+0x1e/0x30
       dev_queue_xmit+0x10/0x20
       ovs_vport_send+0xad/0x170 [openvswitch]
       do_output+0x59/0x180 [openvswitch]
       do_execute_actions+0xa80/0xaa0 [openvswitch]
       ? kfree+0x1/0x250
       ? kfree+0x1/0x250
       ? kprobe_perf_func+0x4f/0x2b0
       ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch]
       ovs_execute_actions+0x4c/0x120 [openvswitch]
       ovs_dp_process_packet+0xa1/0x200 [openvswitch]
       ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch]
       ? ovs_ct_fill_key+0x1d/0x30 [openvswitch]
       ? ovs_flow_key_extract+0x2db/0x350 [openvswitch]
       ovs_vport_receive+0x77/0xd0 [openvswitch]
       ? __htab_map_lookup_elem+0x4e/0x60
       ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714
       ? trace_call_bpf+0xc8/0x150
       ? kfree+0x1/0x250
       ? kfree+0x1/0x250
       ? kprobe_perf_func+0x4f/0x2b0
       ? kprobe_perf_func+0x4f/0x2b0
       ? __mod_memcg_lruvec_state+0x63/0xe0
       netdev_port_receive+0xc4/0x180 [openvswitch]
       ? netdev_port_receive+0x180/0x180 [openvswitch]
       netdev_frame_hook+0x1f/0x40 [openvswitch]
       __netif_receive_skb_core.constprop.0+0x23d/0xf00
       __netif_receive_skb_one_core+0x3f/0xa0
       __netif_receive_skb+0x15/0x60
       process_backlog+0x9e/0x170
       __napi_poll+0x33/0x180
       net_rx_action+0x126/0x280
       ? ttwu_do_activate+0x72/0xf0
       __do_softirq+0xd9/0x2e7
       ? rcu_report_exp_cpu_mult+0x1b0/0x1b0
       do_softirq+0x7d/0xb0
       </IRQ>
       <TASK>
       __local_bh_enable_ip+0x54/0x60
       ip_finish_output2+0x191/0x460
       __ip_finish_output+0xb7/0x180
       ip_finish_output+0x2e/0xc0
       ip_output+0x78/0x100
       ? __ip_finish_output+0x180/0x180
       ip_local_out+0x5e/0x70
       __ip_queue_xmit+0x184/0x440
       ? tcp_syn_options+0x1f9/0x300
       ip_queue_xmit+0x15/0x20
       __tcp_transmit_skb+0x910/0x9c0
       ? __mod_memcg_state+0x44/0xa0
       tcp_connect+0x437/0x4e0
       ? ktime_get_with_offset+0x60/0xf0
       tcp_v4_connect+0x436/0x530
       __inet_stream_connect+0xd4/0x3a0
       ? kprobe_perf_func+0x4f/0x2b0
       ? aa_sk_perm+0x43/0x1c0
       inet_stream_connect+0x3b/0x60
       __sys_connect_file+0x63/0x70
       __sys_connect+0xa6/0xd0
       ? setfl+0x108/0x170
       ? do_fcntl+0xe8/0x5a0
       __x64_sys_connect+0x18/0x20
       do_syscall_64+0x5c/0xc0
       ? __x64_sys_fcntl+0xa9/0xd0
       ? exit_to_user_mode_prepare+0x37/0xb0
       ? syscall_exit_to_user_mode+0x27/0x50
       ? do_syscall_64+0x69/0xc0
       ? __sys_setsockopt+0xea/0x1e0
       ? exit_to_user_mode_prepare+0x37/0xb0
       ? syscall_exit_to_user_mode+0x27/0x50
       ? __x64_sys_setsockopt+0x1f/0x30
       ? do_syscall_64+0x69/0xc0
       ? irqentry_exit+0x1d/0x30
       ? exc_page_fault+0x89/0x170
       entry_SYSCALL_64_after_hwframe+0x61/0xcb
      RIP: 0033:0x7f7b8101c6a7
      Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89
      RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7
      RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005
      RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e
      R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410
      R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000
       </TASK>
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Co-developed-by: default avatarLuca Czesla <luca.czesla@mail.schwarz>
      Signed-off-by: default avatarLuca Czesla <luca.czesla@mail.schwarz>
      Signed-off-by: default avatarFelix Huettner <felix.huettner@mail.schwarz>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      644b3051
  9. Feb 22, 2023
  10. Feb 09, 2023
    • Fedor Pchelkin's avatar
      net: openvswitch: fix flow memory leak in ovs_flow_cmd_new · 70d40674
      Fedor Pchelkin authored
      
      [ Upstream commit 0c598aed ]
      
      Syzkaller reports a memory leak of new_flow in ovs_flow_cmd_new() as it is
      not freed when an allocation of a key fails.
      
      BUG: memory leak
      unreferenced object 0xffff888116668000 (size 632):
        comm "syz-executor231", pid 1090, jiffies 4294844701 (age 18.871s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000defa3494>] kmem_cache_zalloc include/linux/slab.h:654 [inline]
          [<00000000defa3494>] ovs_flow_alloc+0x19/0x180 net/openvswitch/flow_table.c:77
          [<00000000c67d8873>] ovs_flow_cmd_new+0x1de/0xd40 net/openvswitch/datapath.c:957
          [<0000000010a539a8>] genl_family_rcv_msg_doit+0x22d/0x330 net/netlink/genetlink.c:739
          [<00000000dff3302d>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
          [<00000000dff3302d>] genl_rcv_msg+0x328/0x590 net/netlink/genetlink.c:800
          [<000000000286dd87>] netlink_rcv_skb+0x153/0x430 net/netlink/af_netlink.c:2515
          [<0000000061fed410>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
          [<000000009dc0f111>] netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
          [<000000009dc0f111>] netlink_unicast+0x545/0x7f0 net/netlink/af_netlink.c:1339
          [<000000004a5ee816>] netlink_sendmsg+0x8e7/0xde0 net/netlink/af_netlink.c:1934
          [<00000000482b476f>] sock_sendmsg_nosec net/socket.c:651 [inline]
          [<00000000482b476f>] sock_sendmsg+0x152/0x190 net/socket.c:671
          [<00000000698574ba>] ____sys_sendmsg+0x70a/0x870 net/socket.c:2356
          [<00000000d28d9e11>] ___sys_sendmsg+0xf3/0x170 net/socket.c:2410
          [<0000000083ba9120>] __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439
          [<00000000c00628f8>] do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46
          [<000000004abfdcf4>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      To fix this the patch rearranges the goto labels to reflect the order of
      object allocations and adds appropriate goto statements on the error
      paths.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Fixes: 68bb1010 ("openvswitch: Fix flow lookup to use unmasked key")
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230201210218.361970-1-pchelkin@ispras.ru
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      70d40674
  11. Dec 31, 2022
    • Kees Cook's avatar
      openvswitch: Use kmalloc_size_roundup() to match ksize() usage · bde272c8
      Kees Cook authored
      
      [ Upstream commit ab3f7828 ]
      
      Round up allocations with kmalloc_size_roundup() so that openvswitch's
      use of ksize() is always accurate and no special handling of the memory
      is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE.
      
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: dev@openvswitch.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221018090628.never.537-kees@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bde272c8
    • Eelco Chaudron's avatar
      openvswitch: Fix flow lookup to use unmasked key · 32d5fa5b
      Eelco Chaudron authored
      
      [ Upstream commit 68bb1010 ]
      
      The commit mentioned below causes the ovs_flow_tbl_lookup() function
      to be called with the masked key. However, it's supposed to be called
      with the unmasked key. This due to the fact that the datapath supports
      installing wider flows, and OVS relies on this behavior. For example
      if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider
      flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/
      128.0.0.0) is allowed to be added.
      
      However, if we try to add a wildcard rule, the installation fails:
      
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
      ovs-vswitchd: updating flow table (File exists)
      
      The reason is that the key used to determine if the flow is already
      present in the system uses the original key ANDed with the mask.
      This results in the IP address not being part of the (miniflow) key,
      i.e., being substituted with an all-zero value. When doing the actual
      lookup, this results in the key wrongfully matching the first flow,
      and therefore the flow does not get installed.
      
      This change reverses the commit below, but rather than having the key
      on the stack, it's allocated.
      
      Fixes: 190aa3e7 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.")
      
      Signed-off-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      32d5fa5b
  12. Nov 18, 2022
  13. Oct 29, 2022
  14. Oct 27, 2022
  15. Oct 13, 2022
  16. Oct 11, 2022
  17. Sep 27, 2022
  18. Sep 20, 2022
  19. Sep 09, 2022
  20. Aug 29, 2022
    • Jakub Kicinski's avatar
      genetlink: start to validate reserved header bytes · 9c5d03d3
      Jakub Kicinski authored
      
      We had historically not checked that genlmsghdr.reserved
      is 0 on input which prevents us from using those precious
      bytes in the future.
      
      One use case would be to extend the cmd field, which is
      currently just 8 bits wide and 256 is not a lot of commands
      for some core families.
      
      To make sure that new families do the right thing by default
      put the onus of opting out of validation on existing families.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c5d03d3
  21. Aug 27, 2022
  22. Aug 23, 2022
  23. Aug 22, 2022
  24. Jun 23, 2022
    • Rosemarie O'Riorden's avatar
      net: openvswitch: fix parsing of nw_proto for IPv6 fragments · 12378a5a
      Rosemarie O'Riorden authored
      When a packet enters the OVS datapath and does not match any existing
      flows installed in the kernel flow cache, the packet will be sent to
      userspace to be parsed, and a new flow will be created. The kernel and
      OVS rely on each other to parse packet fields in the same way so that
      packets will be handled properly.
      
      As per the design document linked below, OVS expects all later IPv6
      fragments to have nw_proto=44 in the flow key, so they can be correctly
      matched on OpenFlow rules. OpenFlow controllers create pipelines based
      on this design.
      
      This behavior was changed by the commit in the Fixes tag so that
      nw_proto equals the next_header field of the last extension header.
      However, there is no counterpart for this change in OVS userspace,
      meaning that this field is parsed differently between OVS and the
      kernel. This is a problem because OVS creates actions based on what is
      parsed in userspace, but the kernel-provided flow key is used as a match
      criteria, as described in Documentation/networking/openvswitch.rst. This
      leads to issues such as packets incorrectly matching on a flow and thus
      the wrong list of actions being applied to the packet. Such changes in
      packet parsing cannot be implemented without breaking the userspace.
      
      The offending commit is partially reverted to restore the expected
      behavior.
      
      The change technically made sense and there is a good reason that it was
      implemented, but it does not comply with the original design of OVS.
      If in the future someone wants to implement such a change, then it must
      be user-configurable and disabled by default to preserve backwards
      compatibility with existing OVS versions.
      
      Cc: stable@vger.kernel.org
      Fixes: fa642f08 ("openvswitch: Derive IP protocol number for IPv6 later frags")
      Link: https://docs.openvswitch.org/en/latest/topics/design/#fragments
      
      
      Signed-off-by: default avatarRosemarie O'Riorden <roriorden@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20220621204845.9721-1-roriorden@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      12378a5a
  25. Jun 10, 2022
  26. Jun 09, 2022
  27. Apr 15, 2022
    • Paolo Valerio's avatar
      openvswitch: fix OOB access in reserve_sfa_size() · cefa91b2
      Paolo Valerio authored
      
      Given a sufficiently large number of actions, while copying and
      reserving memory for a new action of a new flow, if next_offset is
      greater than MAX_ACTIONS_BUFSIZE, the function reserve_sfa_size() does
      not return -EMSGSIZE as expected, but it allocates MAX_ACTIONS_BUFSIZE
      bytes increasing actions_len by req_size. This can then lead to an OOB
      write access, especially when further actions need to be copied.
      
      Fix it by rearranging the flow action size check.
      
      KASAN splat below:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in reserve_sfa_size+0x1ba/0x380 [openvswitch]
      Write of size 65360 at addr ffff888147e4001c by task handler15/836
      
      CPU: 1 PID: 836 Comm: handler15 Not tainted 5.18.0-rc1+ #27
      ...
      Call Trace:
       <TASK>
       dump_stack_lvl+0x45/0x5a
       print_report.cold+0x5e/0x5db
       ? __lock_text_start+0x8/0x8
       ? reserve_sfa_size+0x1ba/0x380 [openvswitch]
       kasan_report+0xb5/0x130
       ? reserve_sfa_size+0x1ba/0x380 [openvswitch]
       kasan_check_range+0xf5/0x1d0
       memcpy+0x39/0x60
       reserve_sfa_size+0x1ba/0x380 [openvswitch]
       __add_action+0x24/0x120 [openvswitch]
       ovs_nla_add_action+0xe/0x20 [openvswitch]
       ovs_ct_copy_action+0x29d/0x1130 [openvswitch]
       ? __kernel_text_address+0xe/0x30
       ? unwind_get_return_address+0x56/0xa0
       ? create_prof_cpu_mask+0x20/0x20
       ? ovs_ct_verify+0xf0/0xf0 [openvswitch]
       ? prep_compound_page+0x198/0x2a0
       ? __kasan_check_byte+0x10/0x40
       ? kasan_unpoison+0x40/0x70
       ? ksize+0x44/0x60
       ? reserve_sfa_size+0x75/0x380 [openvswitch]
       __ovs_nla_copy_actions+0xc26/0x2070 [openvswitch]
       ? __zone_watermark_ok+0x420/0x420
       ? validate_set.constprop.0+0xc90/0xc90 [openvswitch]
       ? __alloc_pages+0x1a9/0x3e0
       ? __alloc_pages_slowpath.constprop.0+0x1da0/0x1da0
       ? unwind_next_frame+0x991/0x1e40
       ? __mod_node_page_state+0x99/0x120
       ? __mod_lruvec_page_state+0x2e3/0x470
       ? __kasan_kmalloc_large+0x90/0xe0
       ovs_nla_copy_actions+0x1b4/0x2c0 [openvswitch]
       ovs_flow_cmd_new+0x3cd/0xb10 [openvswitch]
       ...
      
      Cc: stable@vger.kernel.org
      Fixes: f28cd2af ("openvswitch: fix flow actions reallocation")
      Signed-off-by: default avatarPaolo Valerio <pvalerio@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cefa91b2
  28. Apr 06, 2022
    • Ilya Maximets's avatar
      net: openvswitch: fix leak of nested actions · 1f30fb91
      Ilya Maximets authored
      While parsing user-provided actions, openvswitch module may dynamically
      allocate memory and store pointers in the internal copy of the actions.
      So this memory has to be freed while destroying the actions.
      
      Currently there are only two such actions: ct() and set().  However,
      there are many actions that can hold nested lists of actions and
      ovs_nla_free_flow_actions() just jumps over them leaking the memory.
      
      For example, removal of the flow with the following actions will lead
      to a leak of the memory allocated by nf_ct_tmpl_alloc():
      
        actions:clone(ct(commit),0)
      
      Non-freed set() action may also leak the 'dst' structure for the
      tunnel info including device references.
      
      Under certain conditions with a high rate of flow rotation that may
      cause significant memory leak problem (2MB per second in reporter's
      case).  The problem is also hard to mitigate, because the user doesn't
      have direct control over the datapath flows generated by OVS.
      
      Fix that by iterating over all the nested actions and freeing
      everything that needs to be freed recursively.
      
      New build time assertion should protect us from this problem if new
      actions will be added in the future.
      
      Unfortunately, openvswitch module doesn't use NLA_F_NESTED, so all
      attributes has to be explicitly checked.  sample() and clone() actions
      are mixing extra attributes into the user-provided action list.  That
      prevents some code generalization too.
      
      Fixes: 34ae932a ("openvswitch: Make tunnel set action attach a metadata dst")
      Link: https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392922.html
      
      
      Reported-by: default avatarStéphane Graber <stgraber@ubuntu.com>
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f30fb91
    • Ilya Maximets's avatar
      net: openvswitch: don't send internal clone attribute to the userspace. · 3f2a3050
      Ilya Maximets authored
      
      'OVS_CLONE_ATTR_EXEC' is an internal attribute that is used for
      performance optimization inside the kernel.  It's added by the kernel
      while parsing user-provided actions and should not be sent during the
      flow dump as it's not part of the uAPI.
      
      The issue doesn't cause any significant problems to the ovs-vswitchd
      process, because reported actions are not really used in the
      application lifecycle and only supposed to be shown to a human via
      ovs-dpctl flow dump.  However, the action list is still incorrect
      and causes the following error if the user wants to look at the
      datapath flows:
      
        # ovs-dpctl add-dp system@ovs-system
        # ovs-dpctl add-flow "<flow match>" "clone(ct(commit),0)"
        # ovs-dpctl dump-flows
        <flow match>, packets:0, bytes:0, used:never,
          actions:clone(bad length 4, expected -1 for: action0(01 00 00 00),
                        ct(commit),0)
      
      With the fix:
      
        # ovs-dpctl dump-flows
        <flow match>, packets:0, bytes:0, used:never,
          actions:clone(ct(commit),0)
      
      Additionally fixed an incorrect attribute name in the comment.
      
      Fixes: b2335040 ("openvswitch: kernel datapath clone action")
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Link: https://lore.kernel.org/r/20220404104150.2865736-1-i.maximets@ovn.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f2a3050
  29. Mar 31, 2022
  30. Mar 29, 2022
  31. Mar 22, 2022
    • Aaron Conole's avatar
      openvswitch: always update flow key after nat · 60b44ca6
      Aaron Conole authored
      
      During NAT, a tuple collision may occur.  When this happens, openvswitch
      will make a second pass through NAT which will perform additional packet
      modification.  This will update the skb data, but not the flow key that
      OVS uses.  This means that future flow lookups, and packet matches will
      have incorrect data.  This has been supported since
      5d50aa83 ("openvswitch: support asymmetric conntrack").
      
      That commit failed to properly update the sw_flow_key attributes, since
      it only called the ovs_ct_nat_update_key once, rather than each time
      ovs_ct_nat_execute was called.  As these two operations are linked, the
      ovs_ct_nat_execute() function should always make sure that the
      sw_flow_key is updated after a successful call through NAT infrastructure.
      
      Fixes: 5d50aa83 ("openvswitch: support asymmetric conntrack")
      Cc: Dumitru Ceara <dceara@redhat.com>
      Cc: Numan Siddique <nusiddiq@redhat.com>
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      60b44ca6
  32. Mar 11, 2022
Loading