Skip to content
Snippets Groups Projects
  1. Nov 08, 2019
  2. Nov 07, 2019
  3. Nov 06, 2019
    • Honggang Li's avatar
      configfs: calculate the depth of parent item · e2f238f7
      Honggang Li authored
      
      When create symbolic link, create_link should calculate the depth
      of the parent item. However, both the first and second parameters
      of configfs_get_target_path had been set to the target. Broken
      symbolic link created.
      
      $ targetcli ls /
      o- / ............................................................. [...]
        o- backstores .................................................. [...]
        | o- block ...................................... [Storage Objects: 0]
        | o- fileio ..................................... [Storage Objects: 2]
        | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
        | | | o- alua ....................................... [ALUA Groups: 1]
        | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
        | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
        | |   o- alua ....................................... [ALUA Groups: 1]
        | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
        | o- pscsi ...................................... [Storage Objects: 0]
        | o- ramdisk .................................... [Storage Objects: 0]
        o- iscsi ................................................ [Targets: 0]
        o- loopback ............................................. [Targets: 0]
        o- srpt ................................................. [Targets: 2]
        | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
        | | o- acls ................................................ [ACLs: 2]
        | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
        | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
        | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
        | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
        | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
        | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
        | | o- luns ................................................ [LUNs: 2]
        | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
        | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
        | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
        |   o- acls ................................................ [ACLs: 0]
        |   o- luns ................................................ [LUNs: 0]
        o- vhost ................................................ [Targets: 0]
      
      Fixes: e9c03af2 ("configfs: calculate the symlink target only once")
      Signed-off-by: default avatarHonggang Li <honli@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e2f238f7
    • Shuning Zhang's avatar
      ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() · e74540b2
      Shuning Zhang authored
      When the extent tree is modified, it should be protected by inode
      cluster lock and ip_alloc_sem.
      
      The extent tree is accessed and modified in the
      ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.
      
      The following is a case.  The function ocfs2_fiemap is accessing the
      extent tree, which is modified at the same time.
      
        kernel BUG at fs/ocfs2/extent_map.c:475!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
        CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
        Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
        task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
        RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        Call Trace:
          ocfs2_fiemap+0x1e3/0x430 [ocfs2]
          do_vfs_ioctl+0x155/0x510
          SyS_ioctl+0x81/0xa0
          system_call_fastpath+0x18/0xd8
        Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
        RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        ---[ end trace c8aa0c8180e869dc ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      This issue can be reproduced every week in a production environment.
      
      This issue is related to the usage mode.  If others use ocfs2 in this
      mode, the kernel will panic frequently.
      
      [akpm@linux-foundation.org: coding style fixes]
      [Fix new warning due to unused function by removing said function - Linus ]
      Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
      
      
      Signed-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e74540b2
  4. Nov 05, 2019
    • Luis Henriques's avatar
      ceph: don't allow copy_file_range when stripe_count != 1 · a3a08193
      Luis Henriques authored
      
      copy_file_range tries to use the OSD 'copy-from' operation, which simply
      performs a full object copy.  Unfortunately, the implementation of this
      system call assumes that stripe_count is always set to 1 and doesn't take
      into account that the data may be striped across an object set.  If the
      file layout has stripe_count different from 1, then the destination file
      data will be corrupted.
      
      For example:
      
      Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
      stripe_size of 2 MiB; the first half of the file will be filled with 'A's
      and the second half will be filled with 'B's:
      
                     0      4M     8M       Obj1     Obj2
                     +------+------+       +----+   +----+
              file:  | AAAA | BBBB |       | AA |   | AA |
                     +------+------+       |----|   |----|
                                           | BB |   | BB |
                                           +----+   +----+
      
      If we copy_file_range this file into a new file (which needs to have the
      same file layout!), then it will start by copying the object starting at
      file offset 0 (Obj1).  And then it will copy the object starting at file
      offset 4M -- which is Obj1 again.
      
      Unfortunately, the solution for this is to not allow remote object copies
      to be performed when the file layout stripe_count is not 1 and simply
      fallback to the default (VFS) copy_file_range implementation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLuis Henriques <lhenriques@suse.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      a3a08193
    • Jeff Layton's avatar
      ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open · 5bb5e6ee
      Jeff Layton authored
      
      If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
      that it already passed d_revalidate so we *know* that it's negative (or
      at least was very recently). Just return -ENOENT in that case.
      
      This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
      call atomic_open with the parent's i_rwsem shared, but calling
      d_splice_alias on a hashed dentry requires the exclusive lock.
      
      If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
      open, and another client were to race in and create the file before we
      issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
      the dentry with the new inode with insufficient locks.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      5bb5e6ee
  5. Nov 04, 2019
    • David Sterba's avatar
      btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC · a5009d3a
      David Sterba authored
      
      The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
      deprecated and scheduled for removal but we actualy do use them for
      'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351
      should have been just the async flag for subvolume creation.
      
      The deprecation has been added in this development cycle, remove it
      until it's time.
      
      Fixes: ebc87351 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5009d3a
    • Josef Bacik's avatar
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik authored
      
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d98da499
  6. Nov 01, 2019
  7. Oct 30, 2019
    • Jens Axboe's avatar
      io_uring: ensure we clear io_kiocb->result before each issue · 6873e0bd
      Jens Axboe authored
      
      We use io_kiocb->result == -EAGAIN as a way to know if we need to
      re-submit a polled request, as -EAGAIN reporting happens out-of-line
      for IO submission failures. This field is cleared when we originally
      allocate the request, but it isn't reset when we retry the submission
      from async context. This can cause issues where we think something
      needs a re-issue, but we're really just reading stale data.
      
      Reset ->result whenever we re-prep a request for polled submission.
      
      Cc: stable@vger.kernel.org
      Fixes: 9e645e11 ("io_uring: add support for sqe links")
      Reported-by: default avatarBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6873e0bd
    • Andrew Price's avatar
      gfs2: Fix initialisation of args for remount · d5798141
      Andrew Price authored
      
      When gfs2 was converted to use fs_context, the initialisation of the
      mount args structure to the currently active args was lost with the
      removal of gfs2_remount_fs(), so the checks of the new args on remount
      became checks against the default values instead of the current ones.
      This caused unexpected remount behaviour and test failures (xfstests
      generic/294, generic/306 and generic/452).
      
      Reinstate the args initialisation, this time in gfs2_init_fs_context()
      and conditional upon fc->purpose, as that's the only time we get control
      before the mount args are parsed in the remount process.
      
      Fixes: 1f52aa08 ("gfs2: Convert gfs2 to fs_context")
      Signed-off-by: default avatarAndrew Price <anprice@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      d5798141
  8. Oct 29, 2019
  9. Oct 28, 2019
  10. Oct 27, 2019
  11. Oct 25, 2019
    • Filipe Manana's avatar
      Btrfs: fix race leading to metadata space leak after task received signal · 0cab7acc
      Filipe Manana authored
      
      When a task that is allocating metadata needs to wait for the async
      reclaim job to process its ticket and gets a signal (because it was killed
      for example) before doing the wait, the task ends up erroring out but
      with space reserved for its ticket, which never gets released, resulting
      in a metadata space leak (more specifically a leak in the bytes_may_use
      counter of the metadata space_info object).
      
      Here's the sequence of steps leading to the space leak:
      
      1) A task tries to create a file for example, so it ends up trying to
         start a transaction at btrfs_create();
      
      2) The filesystem is currently in a state where there is not enough
         metadata free space to satisfy the transaction's needs. So at
         space-info.c:__reserve_metadata_bytes() we create a ticket and
         add it to the list of tickets of the space info object. Also,
         because the metadata async reclaim job is not running, we queue
         a job ro run metadata reclaim;
      
      3) In the meanwhile the task receives a signal (like SIGTERM from
         a kill command for example);
      
      4) After queing the async reclaim job, at __reserve_metadata_bytes(),
         we unlock the metadata space info and call handle_reserve_ticket();
      
      5) That last function calls wait_reserve_ticket(), which acquires the
         lock from the metadata space info. Then in the first iteration of
         its while loop, it calls prepare_to_wait_event(), which returns
         -ERESTARTSYS because the task has a pending signal. As a result,
         we set the error field of the ticket to -EINTR and exit the while
         loop without deleting the ticket from the list of tickets (in the
         space info object). After exiting the loop we unlock the space info;
      
      6) The async reclaim job is able to release enough metadata, acquires
         the metadata space info's lock and then reserves space for the ticket,
         since the ticket is still in the list of (non-priority) tickets. The
         space reservation happens at btrfs_try_granting_tickets(), called from
         maybe_fail_all_tickets(). This increments the bytes_may_use counter
         from the metadata space info object, sets the ticket's bytes field to
         zero (meaning success, that space was reserved) and removes it from
         the list of tickets;
      
      7) wait_reserve_ticket() returns, with the error field of the ticket
         set to -EINTR. Then handle_reserve_ticket() just propagates that error
         to the caller. Because an error was returned, the caller does not
         release the reserved space, since the expectation is that any error
         means no space was reserved.
      
      Fix this by removing the ticket from the list, while holding the space
      info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
      an error.
      
      Also add some comments and an assertion to guarantee we never end up with
      a ticket that has an error set and a bytes counter field set to zero, to
      more easily detect regressions in the future.
      
      This issue could be triggered sporadically by some test cases from fstests
      such as generic/269 for example, which tries to fill a filesystem and then
      kills fsstress processes running in the background.
      
      When this issue happens, we get a warning in syslog/dmesg when unmounting
      the filesystem, like the following:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
        RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
        RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
        R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
        FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x1ad/0x390 [btrfs]
         generic_shutdown_super+0x6c/0x110
         kill_anon_super+0xe/0x30
         btrfs_kill_super+0x12/0xa0 [btrfs]
         deactivate_locked_super+0x3a/0x70
         cleanup_mnt+0xb4/0x160
         task_work_run+0x7e/0xc0
         exit_to_usermode_loop+0xfa/0x100
         do_syscall_64+0x1cb/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f4274d2cb37
        (...)
        RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
        RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
        R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
        R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace bcf4b235461b26f6 ]---
        BTRFS info (device sdb): space_info 4 has 19116032 free, is full
        BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
        BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
      
      Fixes: 374bf9c5 ("btrfs: unify error handling for ticket flushing")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cab7acc
    • Qu Wenruo's avatar
      btrfs: tree-checker: Fix wrong check on max devid · 8bb177d1
      Qu Wenruo authored
      
      [BUG]
      The following script will cause false alert on devid check.
        #!/bin/bash
      
        dev1=/dev/test/test
        dev2=/dev/test/scratch1
        mnt=/mnt/btrfs
      
        umount $dev1 &> /dev/null
        umount $dev2 &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev1
      
        mount $dev1 $mnt
      
        _fail()
        {
                echo "!!! FAILED !!!"
                exit 1
        }
      
        for ((i = 0; i < 4096; i++)); do
                btrfs dev add -f $dev2 $mnt || _fail
                btrfs dev del $dev1 $mnt || _fail
                dev_tmp=$dev1
                dev1=$dev2
                dev2=$dev_tmp
        done
      
      [CAUSE]
      Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
      upper limit for devid.  But we can have devid holes just like above
      script.
      
      So the check for devid is incorrect and could cause false alert.
      
      [FIX]
      Just remove the whole devid check.  We don't have any hard requirement
      for devid assignment.
      
      Furthermore, even devid could get corrupted by a bitflip, we still have
      dev extents verification at mount time, so corrupted data won't sneak
      in.
      
      This fixes fstests btrfs/194.
      
      Reported-by: default avatarAnand Jain <anand.jain@oracle.com>
      Fixes: ab4ba2e1 ("btrfs: tree-checker: Verify dev item")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bb177d1
    • Qu Wenruo's avatar
      btrfs: Consider system chunk array size for new SYSTEM chunks · c17add7a
      Qu Wenruo authored
      
      For SYSTEM chunks, despite the regular chunk item size limit, there is
      another limit due to system chunk array size.
      
      The extra limit was removed in a refactoring, so add it back.
      
      Fixes: e3ecdb3f ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c17add7a
    • Jens Axboe's avatar
      io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD · 2b2ed975
      Jens Axboe authored
      
      We currently assume that submissions from the sqthread are successful,
      and if IO polling is enabled, we use that value for knowing how many
      completions to look for. But if we overflowed the CQ ring or some
      requests simply got errored and already completed, they won't be
      available for polling.
      
      For the case of IO polling and SQTHREAD usage, look at the pending
      poll list. If it ever hits empty then we know that we don't have
      anymore pollable requests inflight. For that case, simply reset
      the inflight count to zero.
      
      Reported-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b2ed975
    • Jens Axboe's avatar
      io_uring: used cached copies of sq->dropped and cq->overflow · 498ccd9e
      Jens Axboe authored
      
      We currently use the ring values directly, but that can lead to issues
      if the application is malicious and changes these values on our behalf.
      Created in-kernel cached versions of them, and just overwrite the user
      side when we update them. This is similar to how we treat the sq/cq
      ring tail/head updates.
      
      Reported-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      498ccd9e
    • Pavel Begunkov's avatar
      io_uring: Fix race for sqes with userspace · 935d1e45
      Pavel Begunkov authored
      
      io_ring_submit() finalises with
      1. io_commit_sqring(), which releases sqes to the userspace
      2. Then calls to io_queue_link_head(), accessing released head's sqe
      
      Reorder them.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      935d1e45
    • Pavel Begunkov's avatar
      io_uring: Fix broken links with offloading · fb5ccc98
      Pavel Begunkov authored
      
      io_sq_thread() processes sqes by 8 without considering links. As a
      result, links will be randomely subdivided.
      
      The easiest way to fix it is to call io_get_sqring() inside
      io_submit_sqes() as do io_ring_submit().
      
      Downsides:
      1. This removes optimisation of not grabbing mm_struct for fixed files
      2. It submitting all sqes in one go, without finer-grained sheduling
      with cq processing.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb5ccc98
    • Pavel Begunkov's avatar
      io_uring: Fix corrupted user_data · 84d55dc5
      Pavel Begunkov authored
      
      There is a bug, where failed linked requests are returned not with
      specified @user_data, but with garbage from a kernel stack.
      
      The reason is that io_fail_links() uses req->user_data, which is
      uninitialised when called from io_queue_sqe() on fail path.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84d55dc5
    • Dave Wysochanski's avatar
      cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs · d46b0da7
      Dave Wysochanski authored
      
      There's a deadlock that is possible and can easily be seen with
      a test where multiple readers open/read/close of the same file
      and a disruption occurs causing reconnect.  The deadlock is due
      a reader thread inside cifs_strict_readv calling down_read and
      obtaining lock_sem, and then after reconnect inside
      cifs_reopen_file calling down_read a second time.  If in
      between the two down_read calls, a down_write comes from
      another process, deadlock occurs.
      
              CPU0                    CPU1
              ----                    ----
      cifs_strict_readv()
       down_read(&cifsi->lock_sem);
                                     _cifsFileInfo_put
                                        OR
                                     cifs_new_fileinfo
                                      down_write(&cifsi->lock_sem);
      cifs_reopen_file()
       down_read(&cifsi->lock_sem);
      
      Fix the above by changing all down_write(lock_sem) calls to
      down_write_trylock(lock_sem)/msleep() loop, which in turn
      makes the second down_read call benign since it will never
      block behind the writer while holding lock_sem.
      
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Suggested-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed--by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      d46b0da7
    • Pavel Shilovsky's avatar
      CIFS: Fix use after free of file info structures · 1a67c415
      Pavel Shilovsky authored
      
      Currently the code assumes that if a file info entry belongs
      to lists of open file handles of an inode and a tcon then
      it has non-zero reference. The recent changes broke that
      assumption when putting the last reference of the file info.
      There may be a situation when a file is being deleted but
      nothing prevents another thread to reference it again
      and start using it. This happens because we do not hold
      the inode list lock while checking the number of references
      of the file info structure. Fix this by doing the proper
      locking when doing the check.
      
      Fixes: 487317c9 ("cifs: add spinlock for the openFileList to cifsInodeInfo")
      Fixes: cb248819 ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
      Cc: Stable <stable@vger.kernel.org>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      1a67c415
    • Pavel Shilovsky's avatar
      CIFS: Fix retry mid list corruption on reconnects · abe57073
      Pavel Shilovsky authored
      
      When the client hits reconnect it iterates over the mid
      pending queue marking entries for retry and moving them
      to a temporary list to issue callbacks later without holding
      GlobalMid_Lock. In the same time there is no guarantee that
      mids can't be removed from the temporary list or even
      freed completely by another thread. It may cause a temporary
      list corruption:
      
      [  430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469
      [  430.464668] ------------[ cut here ]------------
      [  430.466569] kernel BUG at lib/list_debug.c:51!
      [  430.468476] invalid opcode: 0000 [#1] SMP PTI
      [  430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19
      [  430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
      ...
      [  430.510426] Call Trace:
      [  430.511500]  cifs_reconnect+0x25e/0x610 [cifs]
      [  430.513350]  cifs_readv_from_socket+0x220/0x250 [cifs]
      [  430.515464]  cifs_read_from_socket+0x4a/0x70 [cifs]
      [  430.517452]  ? try_to_wake_up+0x212/0x650
      [  430.519122]  ? cifs_small_buf_get+0x16/0x30 [cifs]
      [  430.521086]  ? allocate_buffers+0x66/0x120 [cifs]
      [  430.523019]  cifs_demultiplex_thread+0xdc/0xc30 [cifs]
      [  430.525116]  kthread+0xfb/0x130
      [  430.526421]  ? cifs_handle_standard+0x190/0x190 [cifs]
      [  430.528514]  ? kthread_park+0x90/0x90
      [  430.530019]  ret_from_fork+0x35/0x40
      
      Fix this by obtaining extra references for mids being retried
      and marking them as MID_DELETED which indicates that such a mid
      has been dequeued from the pending list.
      
      Also move mid cleanup logic from DeleteMidQEntry to
      _cifs_mid_q_entry_release which is called when the last reference
      to a particular mid is put. This allows to avoid any use-after-free
      of response buffers.
      
      The patch needs to be backported to stable kernels. A stable tag
      is not mentioned below because the patch doesn't apply cleanly
      to any actively maintained stable kernel.
      
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-and-tested-by: default avatarDavid Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      abe57073
  12. Oct 24, 2019
  13. Oct 23, 2019
  14. Oct 21, 2019
    • Vivek Goyal's avatar
      virtiofs: Retry request submission from worker context · a9bfd9dd
      Vivek Goyal authored
      
      If regular request queue gets full, currently we sleep for a bit and
      retrying submission in submitter's context. This assumes submitter is not
      holding any spin lock. But this assumption is not true for background
      requests. For background requests, we are called with fc->bg_lock held.
      
      This can lead to deadlock where one thread is trying submission with
      fc->bg_lock held while request completion thread has called
      fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As
      request completion thread gets blocked, it does not make further progress
      and that means queue does not get empty and submitter can't submit more
      requests.
      
      To solve this issue, retry submission with the help of a worker, instead of
      retrying in submitter's context. We already do this for hiprio/forget
      requests.
      
      Reported-by: default avatarChirantan Ekbote <chirantan@chromium.org>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      a9bfd9dd
Loading