Skip to content

Test for-next ARM64 64K (regular, SELF)#1630

Open
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next
Open

Test for-next ARM64 64K (regular, SELF)#1630
kdave wants to merge 10000 commits into
ci-arm-kvmfrom
for-next

Conversation

@kdave
Copy link
Copy Markdown
Member

@kdave kdave commented Apr 17, 2026

No description provided.

@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from ad252c6 to af81080 Compare April 18, 2026 04:42
@kdave kdave force-pushed the for-next branch 2 times, most recently from 30c6cb0 to 73d4bbd Compare April 22, 2026 19:47
@kdave kdave force-pushed the for-next branch 2 times, most recently from 26f5cfa to 2189fe7 Compare April 24, 2026 11:09
@kdave kdave force-pushed the for-next branch 2 times, most recently from 5280eae to 52d1b61 Compare April 27, 2026 14:33
@adam900710 adam900710 force-pushed the for-next branch 3 times, most recently from 40c2283 to 09752d4 Compare April 28, 2026 00:45
@kdave kdave force-pushed the for-next branch 2 times, most recently from 29451dd to dc188da Compare April 28, 2026 06:01
@adam900710 adam900710 force-pushed the for-next branch 2 times, most recently from 4a55cf6 to 436ac81 Compare May 3, 2026 08:53
@fdmanana fdmanana force-pushed the for-next branch 2 times, most recently from e32c6db to 49a0b34 Compare May 4, 2026 15:50
@kdave kdave force-pushed the for-next branch 4 times, most recently from 4137f02 to f2ac86e Compare May 12, 2026 15:03
@kdave kdave force-pushed the for-next branch 2 times, most recently from db2485b to 0c78978 Compare May 16, 2026 00:59
morbidrsa and others added 17 commits June 1, 2026 19:40
When a block device does not report a maximum number of open or active
zones,  currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to
the internal limit, if the device has more than
BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones.

But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the
internal max_active_zones limit will stay at 0, even if the device has
zone resource limits. Furthermore, if the device has a total number of
zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones
should be set to at most the number of zones.

Also move the max_active_zone calculation and setting into a dedicated
helper, to shrink btrfs_get_dev_zone_info().

Fixes: 04147d8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The qgroupid has a specific format, add common format specifier, similar
to what we have for checksums and keys.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_path can use the auto freeing pattern and it's completely
contained in send. Define the freeing wrapper and add the cleanup
attributes.

Almost all conversions are straightforward, replacing goto with direct
return.

Signed-off-by: David Sterba <dsterba@suse.com>
should_nocow() reads inode->defrag_bytes without holding inode->lock,
while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent()
update it under that spinlock.

This is a data race.  The read is a quick check used to decide whether
to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and
the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation
can rewrite the extent.  Reading a stale value is harmless because:

  - A missed increment may skip COW once, but the defrag pass will
    redo the extent later.
  - A stale non-zero may force an unnecessary COW, which is a minor
    efficiency loss, not a correctness issue.

On 64-bit platforms an aligned u64 load is naturally atomic so tearing
cannot happen.  On 32-bit platforms u64 may tear, but we only test for
zero vs non-zero, so the heuristic stays correct regardless.  Use
data_race() annotation.

Fixes: 47059d9 ("Btrfs: make defragment work with nodatacow option")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
[ Use data_race() instead of READ_ONCXE() ]
Signed-off-by: David Sterba <dsterba@suse.com>
get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:

  - insert_prealloc_file_extent() memsets the stack item to zero and
    only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
    offset/compression/encryption/other_encoding stay 0.
  - insert_ordered_extent_file_extent() copies oe->compress_type into
    the file extent's compression field, but the data reloc inode is
    created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
    encryption and other_encoding are reserved-and-zero in btrfs.

A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.

A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with

  offset    = oe->offset
  num_bytes = oe->num_bytes
  ram_bytes preserved from the original PREALLOC,

and this item can reach disk if a transaction commit fires while it
is present in the leaf.

The four fields belong in different layers:

  - compression, encryption and other_encoding are universal
    invariants for every item in the data reloc tree, regardless of
    cluster geometry. Enforce them in tree-checker's
    check_extent_data_item() so a corrupt leaf is rejected at read
    time.

  - offset is only an invariant at the cluster-boundary keys that
    get_new_location() searches (the key is computed as
    src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
    writebacks legitimately place REG items at non-boundary keys with
    offset != 0; tree-checker cannot reject these. The cluster-
    boundary item is always written by either
    insert_prealloc_file_extent() (offset=0 by memset) or by the
    front portion of a partial writeback (offset=0 by construction),
    so a non-zero offset there is corruption.

Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.

Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[MINOR PROBLEM]
When a running dev-replace hits some error for the target device (devid
0), there will be a DEV_STATS with error records created at the next
transaction commit.

Unfortunately that item will never to be deleted.

This means at the next dev-replace, if the replace is interrupted, then
at the next mount, the target device will suddenly inherit the old error
records from that DEV_STATS item, which can give some false alerts on
that device.

This shouldn't affect end users that much, as it requires all the
following conditions to be met, which is pretty rare:

- The initial dev-replace hits some error on the target device
  E.g. write errors, but those errors itself is already a big problem
  for a running replace.

  This is required to create the DEV_STATS item in the first place.

- The next replace is interrupted
  This is required to allow btrfs to read from the old records.

[CAUSE]
Btrfs just never deletes the DEV_STATS after a replace is finished.

[FIX]
Remove the DEV_STATS item for devid 0 after the replace is finished.

This is not going to completely fix the error, as we still have other
error paths, e.g. by somehow the fs flips RO and can not start a new
transaction for the DEV_STATS item removal.

But those corner cases will be addressed by later patches which provide
a more generic fix to DEV_STATS related problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[MINOR BUG]
The following script will cause DEV_STATS item to be left after the
corresponding device is removed:

  # mkfs.btrfs -f $dev1
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # umount $mnt

  ## Without real errors, only at mount time btrfs will update
  ## dev->dev_stats_ccnt, thus we need a mount cycle to create the
  ## DEV_STATS item for the new device.

  # mount $dev1 $mnt
  # touch $mnt/foobar
  # sync
  # btrfs dev remove $dev2 $mnt
  # umount $mnt

This will result the DEV_STATS item for devid 2 still left in device
tree:

  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 31064064 items 7 free space 15788 generation 18 owner DEV_TREE
  leaf 31064064 flags 0x1(WRITTEN) backref revision 1
  fs uuid 4bd853ed-f6ef-45fd-bbf1-1c3a2d9987cb
  chunk uuid b496eab1-ec23-46b5-81c1-2f1b3503ca07
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
         	persistent item objectid DEV_STATS offset 1
         	device stats
         	write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
         	persistent item objectid DEV_STATS offset 2
         	device stats
         	write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

This is not a huge problem, but if the existing DEV_STATS contains
errors, and a new device is added into the fs taking the old devid, then
after a mount cycle, the new device will suddenly inherit old errors
which can give false alerts.

[CAUSE]
Btrfs never has the ability to delete DEV_STATS items.

It either create a new one through update_dev_stat_item(), or read an
existing one through btrfs_device_init_dev_stats().

However update_dev_stat_item() is only called lazily, if a new device is
created and no new update to dev stats, then it will skip the update of
the on-disk item.

So if the old DEV_STATS item exists and a new device is added, and no
errors during the remaining operations, the old DEV_STATS will not be
updated.

Then at the next mount cycle, btrfs_device_init_dev_stats() is called at
mount time, which will read out the old records, causing false alerts to
the newly added device.

[FIX]
Manually remove the DEV_STATS item during btrfs_rm_device().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[MINOR PROBLEM]
When adding a new btrfs device, the corresponding DEV_STATS item creation
can only triggered by a mount cycle if there is no other error
triggered:

  # mkfs.btrfs -f $dev1 $mnt
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30588928 items 6 free space 15853 generation 9 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 <<<
         	persistent item objectid DEV_STATS offset 1
         	device stats
         	write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (1 DEV_EXTENT 13631488) itemoff 16195 itemsize 48

Only after a mount cycle and a new transaction, the DEV_STATS for devid
2 can show up:

  # umount $mnt
  # mount $dev1 $mnt
  # touch $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30605312 items 7 free space 15788 generation 10 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
         	persistent item objectid DEV_STATS offset 1
         	device stats
         	write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
         	persistent item objectid DEV_STATS offset 2
         	device stats
         	write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

[CAUSE]
Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt
counter is not 0.

This is to reduce COW for the device tree. However that dev_stats_ccnt is
only increased at the following call sites:

- btrfs_dev_stat_inc()
  This happens when some IO error happened.

- btrfs_dev_stat_read_and_reset()
  This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag.

- btrfs_dev_stat_set()
  This happens inside btrfs_device_init_dev_stats().

So when a new device is added, its dev_stats_ccnt is just initialized to
0, and btrfs won't create nor update the corresponding DEV_STATS item at
all.

[ENHANCEMENT]
When a new device is added, also increase the dev_stats_ccnt by one.
This includes both device add ioctl and dev-replace.

This will force btrfs to create a new DEV_STATS item or update the
existing one with the correct values.

This not only makes the DEV_STATS creation early, but also prevents
old DEV_STATS left from older kernels to cause false alerts for the
newly added device.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[MINOR PROBLEM]
When mounting a filesystem with a valid DEV_STATS item, we will always
update the DEV_STATS again in the next transaction commit, even if there
is no change the values.

[CAUSE]
During the mount, btrfs_device_init_dev_stats() will read out the
on-disk DEV_STATS item for each device.
Then it calls btrfs_dev_stat_set() to update the in-memory structure.

However btrfs_dev_stat_set() does not only set the dev stats value, but
also increase device->dev_stats_ccnt.

That member determines if we should update the device item at the next
transaction commit. Since we have called btrfs_dev_stat_set() for each
dev status member, dev_stats_ccnt will be non-zero and we will update
the dev stats item even it doesn't change at all.

[FIX]
Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS
values, directly call atomic_set() to set the in-memory values.

For other call sites, we still want to use btrfs_dev_stat_set() so that
we will force updating/creating the dev stats item.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting a cloned filesystem with a temporary fsuuid (temp_fsid),
layered modules like overlayfs require a persistent identifier.

While internal in-memory fs_devices->fsid must remain unique to
the kernel module, let s_uuid carry the original on-disk UUID.

Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The f_fsid was originally derived from fs_devices->fsid and the
subvolume root ID. However, when temp_fsid is active, fs_devices->fsid
is randomized, making the standard derivation inconsistent.

Since metadata_uuid is optional, it is not a reliable alternative.  This
patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid.

To prevent f_fsid collisions between original and cloned filesystems,
this implementation hashes the dev_t for single-device btrfs filesystems
to ensure uniqueness. This is limited to single-device filesystems as
cloned mounts are currently only supported for that configuration. Note
that f_fsid will change if the device is replaced.

Additionally, since the kernel cannot distinguish between the original
and the cloned filesystem, this new f_fsid derivation is applied to
both.

Link: https://lore.kernel.org/linux-btrfs/cover.1772095546.git.asj@kernel.org/
Link: https://lore.kernel.org/linux-btrfs/cover.1774092915.git.asj@kernel.org/
Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is
laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of
the 12 bytes a 32-bit userspace expects, because the surrounding struct
is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has
a different size and layout in 32-bit userspace than in the 64-bit
kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers.

Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed
btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields,
define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size
argument, factor the existing handler into a shared _btrfs_ioctl_get_
subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which
fills the kernel struct and translates field-by-field into the 32-bit
struct before copy_to_user().

Signed-off-by: Daan De Meyer <daan@amutable.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally. The buddy allocator classifies them by GFP,
however, and bare GFP_NOFS lands them in MIGRATE_UNMOVABLE pageblocks.

The result: every btree_inode page we read in pins an unmovable pageblock
from the page-superblock allocator's perspective, even though the page
itself can be moved.

Have each caller of btrfs_alloc_page_array, btrfs_alloc_folio_array,
and alloc_eb_folio_array pass in the full GFP mask directly, instead
of having the functions calculate it from boolean flags.

The alloc_extent_buffer call site passes GFP_NOFS | __GFP_NOFAIL |
__GFP_MOVABLE. All other call sites pass plain GFP_NOFS.

Three categories of caller stay on bare GFP_NOFS, deliberately:

  - alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
    resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
    the folios never enter LRU, never get migrate_folio aops. Tagging
    them __GFP_MOVABLE would violate the page allocator's migrability
    contract and they would defeat compaction in MOVABLE pageblocks
    where isolate_migratepages_block skips non-LRU non-movable_ops
    pages outright.

  - btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
    pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (io_uring
    encoded reads), fs/btrfs/relocation.c (relocation buffers): same
    contract violation. raid56 stripe_pages additionally persist in
    the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
    so they are not transient enough to hand-wave the contract.

  - btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
    folios): same -- stripe->folios[] are private buffers freed via
    folio_put in release_scrub_stripe.

This change targets the dominant fragmentation source observed on the
page-superblock series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 250 GB test system with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit b48c980 ("btrfs: fix deadlock between reflink and
transaction commit when using flushoncommit") a deadlock was fixed
between reflinks and transaction commits when the fs is mounted with the
flushoncommit option. This happened when we had to copy an inline extent's
data to the destination file. However the issue was fixed only for the
case where the destination offset is 0, it missed the case when the offset
is greater than zero.

Fix this by ensuring we get i_size update whenever we copied an inline
extent's data into the destination file.

Syzbot reported this with the following trace:

   INFO: task kworker/u8:3:57 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/u8:3    state:D stack:21600 pid:57    tgid:57    ppid:2      task_flags:0x4208160 flags:0x00080000
   Workqueue: writeback wb_workfn (flush-btrfs-129)
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_extent_bit fs/btrfs/extent-io-tree.c:905 [inline]
    btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:2008
    btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline]
    btrfs_invalidate_folio+0x440/0xc00 fs/btrfs/inode.c:7718
    extent_writepage fs/btrfs/extent_io.c:1848 [inline]
    extent_write_cache_pages fs/btrfs/extent_io.c:2552 [inline]
    btrfs_writepages+0x12f3/0x2410 fs/btrfs/extent_io.c:2684
    do_writepages+0x32e/0x550 mm/page-writeback.c:2571
    __writeback_single_inode+0x133/0x10e0 fs/fs-writeback.c:1764
    writeback_sb_inodes+0x97f/0x1980 fs/fs-writeback.c:2056
    wb_writeback+0x445/0xb00 fs/fs-writeback.c:2241
    wb_do_writeback fs/fs-writeback.c:2388 [inline]
    wb_workfn+0x3fd/0xf20 fs/fs-writeback.c:2428
    process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
    process_scheduled_works kernel/workqueue.c:3401 [inline]
    worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
    kthread+0x388/0x470 kernel/kthread.c:436
    ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
    </TASK>
   INFO: task syz.0.145:8523 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:22752 pid:8523  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227
    __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2847
    try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2895
    btrfs_start_delalloc_flush fs/btrfs/transaction.c:2182 [inline]
    btrfs_commit_transaction+0x813/0x2fc0 fs/btrfs/transaction.c:2371
    btrfs_sync_file+0xdf4/0x1230 fs/btrfs/file.c:1822
    generic_write_sync include/linux/fs.h:2663 [inline]
    btrfs_do_write_iter+0x6a9/0x840 fs/btrfs/file.c:1473
    new_sync_write fs/read_write.c:595 [inline]
    vfs_write+0x629/0xba0 fs/read_write.c:688
    ksys_write+0x156/0x270 fs/read_write.c:740
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b446028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 00007f5a0c065fa0 RCX: 00007f5a0bdece59
   RDX: 000000000000029f RSI: 0000200000000200 RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066038 R14: 00007f5a0c065fa0 R15: 00007ffe149206b8
    </TASK>
   INFO: task syz.0.145:8539 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:23704 pid:8539  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:536
    start_transaction+0xbd8/0x1820 fs/btrfs/transaction.c:716
    clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline]
    btrfs_clone+0x1316/0x2540 fs/btrfs/reflink.c:574
    btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:795
    btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:948
    vfs_clone_file_range+0x435/0x7b0 fs/remap_range.c:403
    ioctl_file_clone fs/ioctl.c:239 [inline]
    ioctl_file_clone_range fs/ioctl.c:257 [inline]
    do_vfs_ioctl+0xe15/0x1540 fs/ioctl.c:544
    __do_sys_ioctl fs/ioctl.c:595 [inline]
    __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b425028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f5a0c066090 RCX: 00007f5a0bdece59
   RDX: 00002000000000c0 RSI: 000000004020940d RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066128 R14: 00007f5a0c066090 R15: 00007ffe149206b8
    </TASK>

Reported-by: syzbot+c7443384724bb0f9e913@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a150a09.820a0220.e7972.0006.GAE@google.com/
Fixes: 05a5a76 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert more multiplications of sectorsize or nodesize to use the
shifts. The remaining cases are multiplications by constants that
compiler can optimize by itself, and in tests.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
We're passing simple indicators as int, switch them to bool types.

Signed-off-by: David Sterba <dsterba@suse.com>
For all local indicator variables do simple switch to bool, done on all
files.

Signed-off-by: David Sterba <dsterba@suse.com>
bmaurer and others added 3 commits June 1, 2026 19:54
Under heavy memcg-driven slab reclaim with many memcgs and CPUs,
shrink_slab_memcg() invokes the per-superblock count callback once per
(memcg, NUMA node) tuple. For btrfs that callback reaches
percpu_counter_sum_positive() on fs_info->evictable_extent_maps, which
takes the percpu_counter's raw spinlock with IRQs disabled and walks
every online CPU. With hundreds of memcgs driving reclaim on a host with
dozens of CPUs, this counter lock becomes a global serialization point:
profiles show CPU pinned in the spin_lock_irqsave acquire under
__percpu_counter_sum, with cross-CPU IPIs hitting csd_lock_wait_toolong
while waiting for spinning vCPUs.

The shrinker count is advisory -- super_cache_count() already notes
"counts can change between super_cache_count and super_cache_scan, so we
really don't need locks here." Use percpu_counter_read_positive(), which
is lockless. Worst-case skew is bounded by batch * num_online_cpus (a
few thousand), negligible compared to the millions of extent maps a busy
filesystem accumulates and well within the noise that the shrinker
already tolerates.

Tested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Ben Maurer <bmaurer@meta.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to exclusively lock the mapping, shared locking is enough
to protect from a concurrent set block size operation (BLKBSZSET ioctl).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…subvol()

If we fail to lookup the dir item, we are always returning -ENOENT but
that may not be the reason for the failure, as btrfs_lookup_dir_item() can
return many different errors, such as -EIO or -ENOMEM for example.
Fix this by returning the real error, and also fixup the silly error
message, including the id of the directory and the error.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that btrfs/242 can randomly fail with the
following NULL pointer dereference:

  run fstests btrfs/242 at 2026-06-01 10:25:08
  BTRFS: device fsid d4d7f234-487c-4787-88e4-47a8b68c9874 devid 1 transid 9 /dev/sdc (8:32) scanned by mount (122609)
  BTRFS info (device sdc): first mount of filesystem d4d7f234-487c-4787-88e4-47a8b68c9874
  BTRFS info (device sdc): using crc32c checksum algorithm
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS info (device sdc): allowing degraded mounts
  BTRFS info (device sdc): turning on async discard
  BTRFS info (device sdc): enabling free space tree
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
  user pgtable: 4k pages, 48-bit VAs, pgdp=000000013fd6b000
  CPU: 4 UID: 0 PID: 122625 Comm: fstrim Not tainted 7.0.10-2-default #1 PREEMPT(full) openSUSE Tumbleweed e9a5f6b24978fba3bf015a992f865837fdfff3dd
  Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250812-19.fc42 08/12/2025
  pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_trim_fs+0x34c/0xa00 [btrfs]
  lr : btrfs_trim_fs+0x1f0/0xa00 [btrfs]
  Call trace:
   btrfs_trim_fs+0x34c/0xa00 [btrfs f02c1d570ceea621c69d302ba75dd61868083840] (P)
   btrfs_ioctl_fitrim+0xe8/0x178 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   btrfs_ioctl+0xdd4/0x2bd8 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   __arm64_sys_ioctl+0xac/0x108
   invoke_syscall.constprop.0+0x5c/0xd0
   el0_svc_common.constprop.0+0x40/0xf0
   do_el0_svc+0x24/0x40
   el0_svc+0x40/0x1d0
   el0t_64_sync_handler+0xa0/0xe8
   el0t_64_sync+0x1b0/0x1b8
  Code: 17ffff83 f94017e0 f9002be0 f9402ea0 (f9400c00)
  ---[ end trace 0000000000000000  ]---

Also the reporter is very kind to test the following ASSERT() added to
btrfs_trim_free_extents_throttle():

	ASSERT(device->bdev,
	       "devid=%llu path=%s dev_state=0x%lx\n",
	       device->devid, btrfs_dev_name(device), device->dev_state);

And it shows the following output:

  assertion failed: device->bdev, in extent-tree.c:6630 (devid=2 path=/dev/sdd dev_state=0x82)

Which means the device->bdev is NULL, and the dev_state is
BTRFS_DEV_STATE_IN_FS_METADATA | BTRFS_DEV_STATE_ITEM_FOUND, without
BTRFS_DEV_STATE_WRITEABLE flag set.

[CAUSE]
The pc points to the following call chain:

  btrfs_trim_fs()
  |- btrfs_trim_free_extents()
     |- btrfs_trim_free_extents_throttle()
        |- bdev_max_discard_sectors(device->bdev)

So the NULL pointer dereference is caused by device->bdev being NULL.

This looks impossible by a quick glance, as just before calling
btrfs_trim_free_extents_throttle(), we have skipped any device that has
BTRFS_DEV_STATE_MISSING flag set.

However in this particular case, there is a window where the missing
device is later re-scanned, causing btrfs to remove the
BTRFS_DEV_STATE_MISSING flag:

  btrfs_control_ioctl()
  |- btrfs_scan_one_device()
     |- device_list_add()
        |- rcu_assign_pointer(device->name, name);
        |  This updates the missing device's path to the new good path.
        |
        |- clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)
           This removes the BTRFS_DEV_STATE_MISSING flag.

This allows the missing device to re-appear and clear the
BTRFS_DEV_STATE_MISSING flag.  However the device still does not have
the BTRFS_DEV_STATE_WRITEABLE flag set, nor is its bdev pointer updated.

The bdev pointer remains NULL, triggering the crash later.

[FIX]
This is a big de-synchronization between BTRFS_DEV_STATE_MISSING and
device->bdev pointer, and shows a gap in btrfs's re-appearing-device
handling.

The proper handling of re-appearing device will need quite some extra
work, which is out of the context of this small fix.

Thankfully the regular bbio submission path has already handled it well
by checking if the device->bdev is NULL before submitting.

So here we just fix the crash by checking if the device is writeable and
has a bdev pointer before calling bdev_max_discard_sectors().

Reported-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/linux-btrfs/wlwir19t.fsf@damenly.org/
Fixes: 499f377 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a bug report that fstrim crashed, and that crash is eventually
pinned down to a missing device which re-appeared and screwed up callers
that only checks BTRFS_DEV_STATE_MISSING, but not
BTRFS_DEV_STATE_WRITEABLE nor device->bdev.

A missing device re-appearing can be very tricky, as for now it will
result in a device without WRITEABLE or MISSING flag, and still no bdev
pointer.

As the first step to enhance handling of such re-appearing missing
devices, add a dmesg output when a missing device re-appeared.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging a relocation issue I hit an assertion in backref.c but it
was not super useful, since it could not tell what was the unexpected
value that triggered the assertion. The stack trace was this:

  [583246.338097] assertion failed: !cache->nr_nodes, in fs/btrfs/backref.c:3158
  [583246.339588] ------------[ cut here ]------------
  [583246.340573] kernel BUG at fs/btrfs/backref.c:3158!
  [583246.342075] Oops: invalid opcode: 0000 [#1] SMP PTI
  [583246.343294] CPU: 5 UID: 0 PID: 677957 Comm: btrfs Not tainted 7.1.0-rc4-btrfs-next-234+ #1 PREEMPT(full)
  [583246.345715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [583246.348694] RIP: 0010:btrfs_backref_release_cache.cold+0x61/0x84 [btrfs]
  [583246.350759] Code: 90 d5 7c (...)
  [583246.354923] RSP: 0018:ffffd4fc88c93ad8 EFLAGS: 00010246
  [583246.355982] RAX: 000000000000003e RBX: ffff8dec90d97020 RCX: 0000000000000000
  [583246.357459] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
  [583246.359517] RBP: ffff8dec8eeb78c0 R08: 0000000000000000 R09: 3fffffffffefffff
  [583246.361180] R10: ffffd4fc88c93970 R11: 0000000000000003 R12: ffff8decd21f3470
  [583246.363184] R13: 00000000fffffffe R14: ffff8decd21f3000 R15: ffff8decd21f3000
  [583246.364666] FS:  00007f9a51751400(0000) GS:ffff8df3f4255000(0000) knlGS:0000000000000000
  [583246.366287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [583246.367443] CR2: 00007f9a518ed8f5 CR3: 00000004467c8002 CR4: 0000000000370ef0
  [583246.368969] Call Trace:
  [583246.369541]  <TASK>
  [583246.370040]  relocate_block_group+0xf2/0x520 [btrfs]
  [583246.371243]  btrfs_relocate_block_group+0x9a9/0x22e0 [btrfs]
  [583246.372443]  ? preempt_count_add+0x47/0xa0
  [583247.532978]  ? btrfs_tree_read_lock_nested+0x19/0x90 [btrfs]
  [583247.534520]  ? mutex_lock+0x1a/0x40
  [583247.602233]  ? btrfs_scrub_pause+0x2e/0x120 [btrfs]
  [583247.603543]  btrfs_relocate_chunk+0x3b/0x1a0 [btrfs]
  [583247.604893]  btrfs_balance+0x9d5/0x1920 [btrfs]
  [583247.606189]  ? preempt_count_add+0x69/0xa0
  [583247.607030]  btrfs_ioctl+0x260c/0x2a20 [btrfs]
  [583247.608015]  ? __memcg_slab_free_hook+0x156/0x1a0
  [583247.636971]  __x64_sys_ioctl+0x92/0xe0
  [583247.679247]  do_syscall_64+0x60/0xf20
  [583247.753297]  ? clear_bhb_loop+0x60/0xb0
  [583247.756321]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [583247.787018] RIP: 0033:0x7f9a5186a8db
  [583247.787787] Code: 00 48 89 (...)
  [583247.791410] RSP: 002b:00007fff2ffa6ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [583247.792897] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f9a5186a8db
  [583247.794319] RDX: 00007fff2ffa6bb0 RSI: 00000000c4009420 RDI: 0000000000000003
  [583247.795714] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [583247.797149] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff2ffa903f
  [583247.798685] R13: 00007fff2ffa6bb0 R14: 0000000000000002 R15: 0000000000000002
  [583247.800136]  </TASK>

So update all simple assertions in backref.c to print out the values when
they aren't testing simple boolean conditions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().

The call chain looks like this:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
 |  |  |  |  |- can_finish_ordered_extent()
 |  |  |  |     |- btrfs_mark_ordered_extent_error()
 |  |  |  |        |- mapping_set_error()
 |  |  |  |           Now the address space is marked error.
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |     Thus no error pointer will be returned.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is 0.
 |
 |- Fallback to buffered IO
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.

[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.

The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781f ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec0 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.

Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.

Cc: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

 btrfs_direct_write(pos=0, length=4K)
 |- __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     |- btrfs_get_blocks_direct_write()
 |  |        |- i_size_write()
 |  |           Which updates the isize to the write end (4K).
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT on the first page.
 |  |
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |     Detects a short write, return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0;}
 |     Which resets the return value.
 |
 |- ret = iomap_dio_complet()
 |  Which returns 0.
 |
 |- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
	  isize 0.

[FIX]
Introduce the following btrfs_dio_data members:

- old_isize

- updated_isize
  If the direct write has enlarged the isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.

And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:

 - Only a single writer can be enlarging isize
   Enlarging isize will take the exclusive inode lock.

 - Buffered readers need to wait for the OE we're holding
   Buffered readers will lock extent and wait for OE of the folio range.
   Sometimes we can skip the OE wait, but since all page cache is
   invalidated, the OE wait can not be skipped.

But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.

I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.

However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781f ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

Cc: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.

For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is now zero.
 |
 |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
 |  @ret is zero, thus not meeting the above retry condition
 |
 |- Fallback to buffered

Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.

Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.