Skip to content
Snippets Groups Projects
  1. Oct 18, 2023
    • Jens Axboe's avatar
      io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring address · 8b51a395
      Jens Axboe authored
      
      If we specify a valid CQ ring address but an invalid SQ ring address,
      we'll correctly spot this and free the allocated pages and clear them
      to NULL. However, we don't clear the ring page count, and hence will
      attempt to free the pages again. We've already cleared the address of
      the page array when freeing them, but we don't check for that. This
      causes the following crash:
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      Oops [#1]
      Modules linked in:
      CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56
      Hardware name: ucbbar,riscvemu-bare (DT)
      Workqueue: events_unbound io_ring_exit_work
      epc : io_pages_free+0x2a/0x58
       ra : io_rings_free+0x3a/0x50
       epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0
       status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d
       [<ffffffff808811a2>] io_pages_free+0x2a/0x58
       [<ffffffff80881406>] io_rings_free+0x3a/0x50
       [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424
       [<ffffffff80027234>] process_one_work+0x10c/0x1f4
       [<ffffffff8002756e>] worker_thread+0x252/0x31c
       [<ffffffff8002f5e4>] kthread+0xc4/0xe0
       [<ffffffff8000332a>] ret_from_fork+0xa/0x1c
      
      Check for a NULL array in io_pages_free(), but also clear the page counts
      when we free them to be on the safer side.
      
      Reported-by: default avatar <rtm@csail.mit.edu>
      Fixes: 03d89a2d ("io_uring: support for user allocated memory for rings/sqes")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8b51a395
  2. Oct 03, 2023
    • Jens Axboe's avatar
      io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages · 223ef474
      Jens Axboe authored
      
      On at least arm32, but presumably any arch with highmem, if the
      application passes in memory that resides in highmem for the rings,
      then we should fail that ring creation. We fail it with -EINVAL, which
      is what kernels that don't support IORING_SETUP_NO_MMAP will do as well.
      
      Cc: stable@vger.kernel.org
      Fixes: 03d89a2d ("io_uring: support for user allocated memory for rings/sqes")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      223ef474
  3. Sep 07, 2023
    • Jens Axboe's avatar
      Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()" · 023464fe
      Jens Axboe authored
      
      This reverts commit b484a40d.
      
      This commit cancels all requests with io-wq, not just the ones from the
      originating task. This breaks use cases that have thread pools, or just
      multiple tasks issuing requests on the same ring. The liburing
      regression test for this also shows that problem:
      
      $ test/thread-exit.t
      cqe->res=-125, Expected 512
      
      where an IO thread gets its request canceled rather than complete
      successfully.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      023464fe
    • Pavel Begunkov's avatar
      io_uring: fix unprotected iopoll overflow · 27122c07
      Pavel Begunkov authored
      
      [   71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769
      io_cqring_event_overflow+0x47b/0x6b0
      [   71.498381] Call Trace:
      [   71.498590]  <TASK>
      [   71.501858]  io_req_cqe_overflow+0x105/0x1e0
      [   71.502194]  __io_submit_flush_completions+0x9f9/0x1090
      [   71.503537]  io_submit_sqes+0xebd/0x1f00
      [   71.503879]  __do_sys_io_uring_enter+0x8c5/0x2380
      [   71.507360]  do_syscall_64+0x39/0x80
      
      We decoupled CQ locking from ->task_complete but haven't fixed up places
      forcing locking for CQ overflows.
      
      Fixes: ec26c225 ("io_uring: merge iopoll and normal completion paths")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      27122c07
    • Pavel Begunkov's avatar
      io_uring: break out of iowq iopoll on teardown · 45500dc4
      Pavel Begunkov authored
      
      io-wq will retry iopoll even when it failed with -EAGAIN. If that
      races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers,
      such workers might potentially infinitely spin retrying iopoll again and
      again and each time failing on some allocation / waiting / etc. Don't
      keep spinning if io-wq is dying.
      
      Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      45500dc4
  4. Sep 05, 2023
  5. Sep 01, 2023
  6. Aug 24, 2023
  7. Aug 21, 2023
  8. Aug 16, 2023
  9. Aug 11, 2023
  10. Aug 10, 2023
  11. Aug 09, 2023
    • Jens Axboe's avatar
      io_uring: cleanup 'ret' handling in io_iopoll_check() · 9e4bef2b
      Jens Axboe authored
      
      We return 0 for success, or -error when there's an error. Move the 'ret'
      variable into the loop where we are actually using it, to make it
      clearer that we don't carry this variable forward for return outside of
      the loop.
      
      While at it, also move the need_resched() break condition out of the
      while check itself, keeping it with the signal pending check.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9e4bef2b
    • Pavel Begunkov's avatar
      io_uring: break iopolling on signal · dc314886
      Pavel Begunkov authored
      
      Don't keep spinning iopoll with a signal set. It'll eventually return
      back, e.g. by virtue of need_resched(), but it's not a nice user
      experience.
      
      Cc: stable@vger.kernel.org
      Fixes: def596e9 ("io_uring: support for IO polling")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dc314886
    • Pavel Begunkov's avatar
      io_uring: fix false positive KASAN warnings · 569f5308
      Pavel Begunkov authored
      
      io_req_local_work_add() peeks into the work list, which can be executed
      in the meanwhile. It's completely fine without KASAN as we're in an RCU
      read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may
      trigger a false positive warning because internal io_uring caches are
      sanitised.
      
      Remove sanitisation from the io_uring request cache for now.
      
      Cc: stable@vger.kernel.org
      Fixes: 8751d154 ("io_uring: reduce scheduling due to tw")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      569f5308
    • Pavel Begunkov's avatar
      io_uring: fix drain stalls by invalid SQE · cfdbaa3a
      Pavel Begunkov authored
      
      cq_extra is protected by ->completion_lock, which io_get_sqe() misses.
      The bug is harmless as it doesn't happen in real life, requires invalid
      SQ index array and racing with submission, and only messes up the
      userspace, i.e. stall requests execution but will be cleaned up on
      ring destruction.
      
      Fixes: 15641e42 ("io_uring: don't cache number of dropped SQEs")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cfdbaa3a
    • Jens Axboe's avatar
      io_uring: annotate the struct io_kiocb slab for appropriate user copy · b97f96e2
      Jens Axboe authored
      
      When compiling the kernel with clang and having HARDENED_USERCOPY
      enabled, the liburing openat2.t test case fails during request setup:
      
      usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)!
      ------------[ cut here ]------------
      kernel BUG at mm/usercopy.c:102!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU: 3 PID: 413 Comm: openat2.t Tainted: G                 N 6.4.3-g6995e2de6891-dirty #19
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
      RIP: 0010:usercopy_abort+0x84/0x90
      Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56
      RSP: 0018:ffffc900016b3da0 EFLAGS: 00010296
      RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00
      RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff
      RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffc900016b3c88 R11: ffffc900016b3c30 R12: 00007ffe549659e0
      R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018
      FS:  00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0
      Call Trace:
       <TASK>
       ? __die_body+0x63/0xb0
       ? die+0x9d/0xc0
       ? do_trap+0xa7/0x180
       ? usercopy_abort+0x84/0x90
       ? do_error_trap+0xc6/0x110
       ? usercopy_abort+0x84/0x90
       ? handle_invalid_op+0x2c/0x40
       ? usercopy_abort+0x84/0x90
       ? exc_invalid_op+0x2f/0x40
       ? asm_exc_invalid_op+0x16/0x20
       ? usercopy_abort+0x84/0x90
       __check_heap_object+0xe2/0x110
       __check_object_size+0x142/0x3d0
       io_openat2_prep+0x68/0x140
       io_submit_sqes+0x28a/0x680
       __se_sys_io_uring_enter+0x120/0x580
       do_syscall_64+0x3d/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x55714834de26
      Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06
      RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
      RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
      R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057
      R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      
      when it tries to copy struct open_how from userspace into the per-command
      space in the io_kiocb. There's nothing wrong with the copy, but we're
      missing the appropriate annotations for allowing user copies to/from the
      io_kiocb slab.
      
      Allow copies in the per-command area, which is from the 'file' pointer to
      when 'opcode' starts. We do have existing user copies there, but they are
      not all annotated like the one that openat2_prep() uses,
      copy_struct_from_user(). But in practice opcodes should be allowed to
      copy data into their per-command area in the io_kiocb.
      
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b97f96e2
  12. Aug 08, 2023
  13. Jul 24, 2023
  14. Jul 21, 2023
    • Helge Deller's avatar
      io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area() · 32832a40
      Helge Deller authored
      
      The io_uring testcase is broken on IA-64 since commit d808459b
      ("io_uring: Adjust mapping wrt architecture aliasing requirements").
      
      The reason is, that this commit introduced an own architecture
      independend get_unmapped_area() search algorithm which finds on IA-64 a
      memory region which is outside of the regular memory region used for
      shared userspace mappings and which can't be used on that platform
      due to aliasing.
      
      To avoid similar problems on IA-64 and other platforms in the future,
      it's better to switch back to the architecture-provided
      get_unmapped_area() function and adjust the needed input parameters
      before the call. Beside fixing the issue, the function now becomes
      easier to understand and maintain.
      
      This patch has been successfully tested with the io_uring testcase on
      physical x86-64, ppc64le, IA-64 and PA-RISC machines. On PA-RISC the LTP
      mmmap testcases did not report any regressions.
      
      Cc: stable@vger.kernel.org # 6.4
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Reported-by: default avatarmatoro <matoro_mailinglist_kernel@matoro.tk>
      Fixes: d808459b ("io_uring: Adjust mapping wrt architecture aliasing requirements")
      Link: https://lore.kernel.org/r/20230721152432.196382-2-deller@gmx.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      32832a40
  15. Jul 20, 2023
  16. Jul 18, 2023
  17. Jul 07, 2023
    • Andres Freund's avatar
      io_uring: Use io_schedule* in cqring wait · 8a796565
      Andres Freund authored
      
      I observed poor performance of io_uring compared to synchronous IO. That
      turns out to be caused by deeper CPU idle states entered with io_uring,
      due to io_uring using plain schedule(), whereas synchronous IO uses
      io_schedule().
      
      The losses due to this are substantial. On my cascade lake workstation,
      t/io_uring from the fio repository e.g. yields regressions between 20%
      and 40% with the following command:
      ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
      
      This is repeatable with different filesystems, using raw block devices
      and using different block devices.
      
      Use io_schedule_prepare() / io_schedule_finish() in
      io_cqring_wait_schedule() to address the difference.
      
      After that using io_uring is on par or surpassing synchronous IO (using
      registered files etc makes it reliably win, but arguably is a less fair
      comparison).
      
      There are other calls to schedule() in io_uring/, but none immediately
      jump out to be similarly situated, so I did not touch them. Similarly,
      it's possible that mutex_lock_io() should be used, but it's not clear if
      there are cases where that matters.
      
      Cc: stable@vger.kernel.org # 5.10+
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: io-uring@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarAndres Freund <andres@anarazel.de>
      Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de
      
      
      [axboe: minor style fixup]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a796565
  18. Jun 28, 2023
    • Jens Axboe's avatar
      io_uring: flush offloaded and delayed task_work on exit · dfbe5561
      Jens Axboe authored
      
      io_uring offloads task_work for cancelation purposes when the task is
      exiting. This is conceptually fine, but we should be nicer and actually
      wait for that work to complete before returning.
      
      Add an argument to io_fallback_tw() telling it to flush the deferred
      work when it's all queued up, and have it flush a ctx behind whenever
      the ctx changes.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dfbe5561
  19. Jun 27, 2023
  20. Jun 23, 2023
Loading