Skip to content
Snippets Groups Projects
  1. Feb 03, 2025
  2. Jan 14, 2025
  3. Aug 12, 2024
  4. Jun 12, 2024
    • Dave Chinner's avatar
      iomap: write iomap validity checks · e47be631
      Dave Chinner authored and Frieder Schrempf's avatar Frieder Schrempf committed
      [ Upstream commit d7b64041 ]
      
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e47be631
    • Dave Chinner's avatar
      iomap: buffered write failure should not truncate the page cache · c88e8f30
      Dave Chinner authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit f43dc4dc ]
      
      iomap_file_buffered_write_punch_delalloc() currently invalidates the
      page cache over the unused range of the delalloc extent that was
      allocated. While the write allocated the delalloc extent, it does
      not own it exclusively as the write does not hold any locks that
      prevent either writeback or mmap page faults from changing the state
      of either the page cache or the extent state backing this range.
      
      Whilst xfs_bmap_punch_delalloc_range() already handles races in
      extent conversion - it will only punch out delalloc extents and it
      ignores any other type of extent - the page cache truncate does not
      discriminate between data written by this write or some other task.
      As a result, truncating the page cache can result in data corruption
      if the write races with mmap modifications to the file over the same
      range.
      
      generic/346 exercises this workload, and if we randomly fail writes
      (as will happen when iomap gets stale iomap detection later in the
      patchset), it will randomly corrupt the file data because it removes
      data written by mmap() in the same page as the write() that failed.
      
      Hence we do not want to punch out the page cache over the range of
      the extent we failed to write to - what we actually need to do is
      detect the ranges that have dirty data in cache over them and *not
      punch them out*.
      
      To do this, we have to walk the page cache over the range of the
      delalloc extent we want to remove. This is made complex by the fact
      we have to handle partially up-to-date folios correctly and this can
      happen even when the FSB size == PAGE_SIZE because we now support
      multi-page folios in the page cache.
      
      Because we are only interested in discovering the edges of data
      ranges in the page cache (i.e. hole-data boundaries) we can make use
      of mapping_seek_hole_data() to find those transitions in the page
      cache. As we hold the invalidate_lock, we know that the boundaries
      are not going to change while we walk the range. This interface is
      also byte-based and is sub-page block aware, so we can find the data
      ranges in the cache based on byte offsets rather than page, folio or
      fs block sized chunks. This greatly simplifies the logic of finding
      dirty cached ranges in the page cache.
      
      Once we've identified a range that contains cached data, we can then
      iterate the range folio by folio. This allows us to determine if the
      data is dirty and hence perform the correct delalloc extent punching
      operations. The seek interface we use to iterate data ranges will
      give us sub-folio start/end granularity, so we may end up looking up
      the same folio multiple times as the seek interface iterates across
      each discontiguous data region in the folio.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c88e8f30
    • Dave Chinner's avatar
      xfs,iomap: move delalloc punching to iomap · 72964d74
      Dave Chinner authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 9c7babf9 ]
      
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72964d74
  5. Dec 12, 2023
  6. Oct 12, 2023
  7. Oct 02, 2022
    • Darrick J. Wong's avatar
      iomap: add a tracepoint for mappings returned by map_blocks · adc9c2e5
      Darrick J. Wong authored
      
      Add a new tracepoint so we can see what mapping the filesystem returns
      to writeback a dirty page.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      adc9c2e5
    • Darrick J. Wong's avatar
      iomap: iomap: fix memory corruption when recording errors during writeback · 3d5f3ba1
      Darrick J. Wong authored
      
      Every now and then I see this crash on arm64:
      
      Unable to handle kernel NULL pointer dereference at virtual address 00000000000000f8
      Buffer I/O error on dev dm-0, logical block 8733687, async page read
      Mem abort info:
        ESR = 0x0000000096000006
        EC = 0x25: DABT (current EL), IL = 32 bits
        SET = 0, FnV = 0
        EA = 0, S1PTW = 0
        FSC = 0x06: level 2 translation fault
      Data abort info:
        ISV = 0, ISS = 0x00000006
        CM = 0, WnR = 0
      user pgtable: 64k pages, 42-bit VAs, pgdp=0000000139750000
      [00000000000000f8] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000
      Internal error: Oops: 96000006 [#1] PREEMPT SMP
      Buffer I/O error on dev dm-0, logical block 8733688, async page read
      Dumping ftrace buffer:
      Buffer I/O error on dev dm-0, logical block 8733689, async page read
         (ftrace buffer empty)
      XFS (dm-0): log I/O error -5
      Modules linked in: dm_thin_pool dm_persistent_data
      XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ec/0x590 [xfs] (fs/xfs/xfs_trans_buf.c:296).
       dm_bio_prison
      XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
      XFS (dm-0): xfs_imap_lookup: xfs_ialloc_read_agi() returned error -5, agno 0
       dm_bufio dm_log_writes xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT
      potentially unexpected fatal signal 6.
       nf_reject_ipv6
      potentially unexpected fatal signal 6.
       ipt_REJECT nf_reject_ipv4
      CPU: 1 PID: 122166 Comm: fsstress Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7
       rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables
      Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
      pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
       ip_tables
      pc : 000003fd6d7df200
       x_tables
      lr : 000003fd6d7df1ec
       overlay nfsv4
      CPU: 0 PID: 54031 Comm: u4:3 Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7405
      Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
      Workqueue: writeback wb_workfn
      sp : 000003ffd9522fd0
       (flush-253:0)
      pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--)
      pc : errseq_set+0x1c/0x100
      x29: 000003ffd9522fd0 x28: 0000000000000023 x27: 000002acefeb6780
      x26: 0000000000000005 x25: 0000000000000001 x24: 0000000000000000
      x23: 00000000ffffffff x22: 0000000000000005
      lr : __filemap_set_wb_err+0x24/0xe0
       x21: 0000000000000006
      sp : fffffe000f80f760
      x29: fffffe000f80f760 x28: 0000000000000003 x27: fffffe000f80f9f8
      x26: 0000000002523000 x25: 00000000fffffffb x24: fffffe000f80f868
      x23: fffffe000f80fbb0 x22: fffffc0180c26a78 x21: 0000000002530000
      x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
      
      x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
      x14: 0000000000000001 x13: 0000000000470af3 x12: fffffc0058f70000
      x11: 0000000000000040 x10: 0000000000001b20 x9 : fffffe000836b288
      x8 : fffffc00eb9fd480 x7 : 0000000000f83659 x6 : 0000000000000000
      x5 : 0000000000000869 x4 : 0000000000000005 x3 : 00000000000000f8
      x20: 000003fd6d740020 x19: 000000000001dd36 x18: 0000000000000001
      x17: 000003fd6d78704c x16: 0000000000000001 x15: 000002acfac87668
      x2 : 0000000000000ffa x1 : 00000000fffffffb x0 : 00000000000000f8
      Call trace:
       errseq_set+0x1c/0x100
       __filemap_set_wb_err+0x24/0xe0
       iomap_do_writepage+0x5e4/0xd5c
       write_cache_pages+0x208/0x674
       iomap_writepages+0x34/0x60
       xfs_vm_writepages+0x8c/0xcc [xfs 7a861f39c43631f15d3a5884246ba5035d4ca78b]
      x14: 0000000000000000 x13: 2064656e72757465 x12: 0000000000002180
      x11: 000003fd6d8a82d0 x10: 0000000000000000 x9 : 000003fd6d8ae288
      x8 : 0000000000000083 x7 : 00000000ffffffff x6 : 00000000ffffffee
      x5 : 00000000fbad2887 x4 : 000003fd6d9abb58 x3 : 000003fd6d740020
      x2 : 0000000000000006 x1 : 000000000001dd36 x0 : 0000000000000000
      CPU: 1 PID: 122167 Comm: fsstress Tainted: G        W          6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7
       do_writepages+0x90/0x1c4
       __writeback_single_inode+0x4c/0x4ac
      Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021
       writeback_sb_inodes+0x214/0x4ac
       wb_writeback+0xf4/0x3b0
      pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
       wb_workfn+0xfc/0x580
       process_one_work+0x1e8/0x480
      pc : 000003fd6d7df200
       worker_thread+0x78/0x430
      
      This crash is a result of iomap_writepage_map encountering some sort of
      error during writeback and wanting to set that error code in the file
      mapping so that fsync will report it.  Unfortunately, the code
      dereferences folio->mapping after unlocking the folio, which means that
      another thread could have removed the page from the page cache
      (writeback doesn't hold the invalidation lock) and give it to somebody
      else.
      
      At best we crash the system like above; at worst, we corrupt memory or
      set an error on some other unsuspecting file while failing to record the
      problems with *this* file.  Regardless, fix the problem by reporting the
      error to the inode mapping.
      
      NOTE: Commit 598ecfba lifted the XFS writeback code to iomap, so
      this fix should be backported to XFS in the 4.6-5.4 kernels in addition
      to iomap in the 5.5-5.19 kernels.
      
      Fixes: e735c007 ("iomap: Convert iomap_add_to_ioend() to take a folio") # 5.17 onward
      Fixes: 598ecfba ("iomap: lift the xfs writeback code to iomap") # 5.5-5.16, needs backporting
      Fixes: 150d5be0 ("xfs: remove xfs_cancel_ioend") # 4.6-5.4, needs backporting
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      3d5f3ba1
  8. Aug 09, 2022
    • Al Viro's avatar
      new iov_iter flavour - ITER_UBUF · fcb14cb1
      Al Viro authored
      
      Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
      checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
      ones.
      
      We are going to expose the things like ->write_iter() et.al. to those
      in subsequent commits.
      
      New predicate (user_backed_iter()) that is true for ITER_IOVEC and
      ITER_UBUF; places like direct-IO handling should use that for
      checking that pages we modify after getting them from iov_iter_get_pages()
      would need to be dirtied.
      
      DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
      will solve all problems - there's code that uses iter_is_iovec() to
      decide how to poke around in iov_iter guts and for that the predicate
      replacement obviously won't suffice.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fcb14cb1
  9. Aug 02, 2022
  10. Jul 25, 2022
  11. Jul 22, 2022
  12. Jul 14, 2022
  13. Jun 30, 2022
    • Kaixu Xia's avatar
      iomap: set did_zero to true when zeroing successfully · 98eb8d95
      Kaixu Xia authored
      
      It is unnecessary to check and set did_zero value in while() loop
      in iomap_zero_iter(), we can set did_zero to true only when zeroing
      successfully at last.
      
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      98eb8d95
    • Chris Mason's avatar
      iomap: skip pages past eof in iomap_do_writepage() · d58562ca
      Chris Mason authored
      
      iomap_do_writepage() sends pages past i_size through
      folio_redirty_for_writepage(), which normally isn't a problem because
      truncate and friends clean them very quickly.
      
      When the system has cgroups configured, we can end up in situations
      where one cgroup has almost no dirty pages at all, and other cgroups
      consume the entire background dirty limit.  This is especially common in
      our XFS workloads in production because they have cgroups using O_DIRECT
      for almost all of the IO mixed in with cgroups that do more traditional
      buffered IO work.
      
      We've hit storms where the redirty path hits millions of times in a few
      seconds, on all a single file that's only ~40 pages long.  This leads to
      long tail latencies for file writes because the pdflush workers are
      hogging the CPU from some kworkers bound to the same CPU.
      
      Reproducing this on 5.18 was tricky because 869ae85d ("xfs: flush new
      eof page on truncate...") ends up writing/waiting most of these dirty pages
      before truncate gets a chance to wait on them.
      
      The actual repro looks like this:
      
      /*
       * run me in a cgroup all alone.  Start a second cgroup with dd
       * streaming IO into the block device.
       */
      int main(int ac, char **av) {
      	int fd;
      	int ret;
      	char buf[BUFFER_SIZE];
      	char *filename = av[1];
      
      	memset(buf, 0, BUFFER_SIZE);
      
      	if (ac != 2) {
      		fprintf(stderr, "usage: looper filename\n");
      		exit(1);
      	}
      	fd = open(filename, O_WRONLY | O_CREAT, 0600);
      	if (fd < 0) {
      		err(errno, "failed to open");
      	}
      	fprintf(stderr, "looping on %s\n", filename);
      	while(1) {
      		/*
      		 * skip past page 0 so truncate doesn't write and wait
      		 * on our extent before changing i_size
      		 */
      		ret = lseek(fd, 8192, SEEK_SET);
      		if (ret < 0)
      			err(errno, "lseek");
      		ret = write(fd, buf, BUFFER_SIZE);
      		if (ret != BUFFER_SIZE)
      			err(errno, "write failed");
      		/* start IO so truncate has to wait after i_size is 0 */
      		ret = sync_file_range(fd, 16384, 4095, SYNC_FILE_RANGE_WRITE);
      		if (ret < 0)
      			err(errno, "sync_file_range");
      		ret = ftruncate(fd, 0);
      		if (ret < 0)
      			err(errno, "truncate");
      		usleep(1000);
      	}
      }
      
      And this bpftrace script will show when you've hit a redirty storm:
      
      kretprobe:xfs_vm_writepages {
          delete(@dirty[pid]);
      }
      
      kprobe:xfs_vm_writepages {
          @dirty[pid] = 1;
      }
      
      kprobe:folio_redirty_for_writepage /@dirty[pid] > 0/ {
          $inode = ((struct folio *)arg1)->mapping->host->i_ino;
          @inodes[$inode] = count();
          @redirty++;
          if (@redirty > 90000) {
              printf("inode %d redirty was %d", $inode, @redirty);
              exit();
          }
      }
      
      This patch has the same number of failures on xfstests as unpatched 5.18:
      Failures: generic/648 xfs/019 xfs/050 xfs/168 xfs/299 xfs/348 xfs/506
      xfs/543
      
      I also ran it through a long stress of multiple fsx processes hammering.
      
      (Johannes Weiner did significant tracing and debugging on this as well)
      
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Co-authored-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarDomas Mituzas <domas@fb.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d58562ca
  14. Jun 29, 2022
  15. Jun 27, 2022
  16. Jun 10, 2022
  17. May 16, 2022
    • Darrick J. Wong's avatar
      iomap: don't invalidate folios after writeback errors · e9c3a8e8
      Darrick J. Wong authored
      
      XFS has the unique behavior (as compared to the other Linux filesystems)
      that on writeback errors it will completely invalidate the affected
      folio and force the page cache to reread the contents from disk.  All
      other filesystems leave the page mapped and up to date.
      
      This is a rude awakening for user programs, since (in the case where
      write fails but reread doesn't) file contents will appear to revert to
      old disk contents with no notification other than an EIO on fsync.  This
      might have been annoying back in the days when iomap dealt with one page
      at a time, but with multipage folios, we can now throw away *megabytes*
      worth of data for a single write error.
      
      On *most* Linux filesystems, a program can respond to an EIO on write by
      redirtying the entire file and scheduling it for writeback.  This isn't
      foolproof, since the page that failed writeback is no longer dirty and
      could be evicted, but programs that want to recover properly *also*
      have to detect XFS and regenerate every write they've made to the file.
      
      When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio
      invalidates multipage folios that could be undergoing writeback.  If,
      say, we have a 256K folio caching a mix of written and unwritten
      extents, it's possible that we could start writeback of the first (say)
      64K of the folio and then hit a writeback error on the next 64K.  We
      then free the iop attached to the folio, which is really bad because
      writeback completion on the first 64k will trip over the "blocks per
      folio > 1 && !iop" assertion.
      
      This can't be fixed by only invalidating the folio if writeback fails at
      the start of the folio, since the folio is marked !uptodate, which trips
      other assertions elsewhere.  Get rid of the whole behavior entirely.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e9c3a8e8
    • Christoph Hellwig's avatar
      iomap: add per-iomap_iter private data · 786f847f
      Christoph Hellwig authored
      
      Allow the file system to keep state for all iterations.  For now only
      wire it up for direct I/O as there is an immediate need for it there.
      
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      786f847f
    • Christoph Hellwig's avatar
      iomap: allow the file system to provide a bio_set for direct I/O · 908c5490
      Christoph Hellwig authored
      
      Allow the file system to provide a specific bio_set for allocating
      direct I/O bios.  This will allow file systems that use the
      ->submit_io hook to stash away additional information for file system
      use.
      
      To make use of this additional space for information in the completion
      path, the file system needs to override the ->bi_end_io callback and
      then call back into iomap, so export iomap_dio_bio_end_io for that.
      
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      908c5490
  18. May 10, 2022
  19. May 09, 2022
  20. May 08, 2022
  21. May 02, 2022
  22. Apr 18, 2022
Loading