Skip to content

Commit a11452a

Browse files
fdmananakdave
authored andcommitted
btrfs: send: avoid unaligned encoded writes when attempting to clone range
When trying to see if we can clone a file range, there are cases where we end up sending two write operations in case the inode from the source root has an i_size that is not sector size aligned and the length from the current offset to its i_size is less than the remaining length we are trying to clone. Issuing two write operations when we could instead issue a single write operation is not incorrect. However it is not optimal, specially if the extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed to the send ioctl. In that case we can end up sending an encoded write with an offset that is not sector size aligned, which makes the receiver fallback to decompressing the data and writing it using regular buffered IO (so re-compressing the data in case the fs is mounted with compression enabled), because encoded writes fail with -EINVAL when an offset is not sector size aligned. The following example, which triggered a bug in the receiver code for the fallback logic of decompressing + regular buffer IO and is fixed by the patchset referred in a Link at the bottom of this changelog, is an example where we have the non-optimal behaviour due to an unaligned encoded write: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV > /dev/null mount -o compress $DEV $MNT # File foo has a size of 33K, not aligned to the sector size. xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar # Now clone the first 32K of file bar into foo at offset 0. xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo # Snapshot the default subvolume and create a full send stream (v2). btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send --compressed-data -f /tmp/test.send $MNT/snap echo -e "\nFile bar in the original filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT echo -e "\nReceiving stream in a new filesystem..." btrfs receive -f /tmp/test.send $MNT echo -e "\nFile bar in the new filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT Before this patch, the send stream included one regular write and one encoded write for file 'bar', with the later being not sector size aligned and causing the receiver to fallback to decompression + buffered writes. The output of the btrfs receive command in verbose mode (-vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 write bar - offset=32768 length=1024 encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0 encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument") (...) This patch avoids the regular write followed by an unaligned encoded write so that we end up sending a single encoded write that is aligned. So after this patch the stream content is (output of btrfs receive -vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0 (...) So we get more optimal behaviour and avoid the silent data loss bug in versions of btrfs-progs affected by the bug referred by the Link tag below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1). Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent c51f0e6 commit a11452a

1 file changed

Lines changed: 23 additions & 1 deletion

File tree

fs/btrfs/send.c

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5702,6 +5702,7 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
57025702
u64 ext_len;
57035703
u64 clone_len;
57045704
u64 clone_data_offset;
5705+
bool crossed_src_i_size = false;
57055706

57065707
if (slot >= btrfs_header_nritems(leaf)) {
57075708
ret = btrfs_next_leaf(clone_root->root, path);
@@ -5759,8 +5760,10 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
57595760
if (key.offset >= clone_src_i_size)
57605761
break;
57615762

5762-
if (key.offset + ext_len > clone_src_i_size)
5763+
if (key.offset + ext_len > clone_src_i_size) {
57635764
ext_len = clone_src_i_size - key.offset;
5765+
crossed_src_i_size = true;
5766+
}
57645767

57655768
clone_data_offset = btrfs_file_extent_offset(leaf, ei);
57665769
if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
@@ -5821,6 +5824,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
58215824
ret = send_clone(sctx, offset, clone_len,
58225825
clone_root);
58235826
}
5827+
} else if (crossed_src_i_size && clone_len < len) {
5828+
/*
5829+
* If we are at i_size of the clone source inode and we
5830+
* can not clone from it, terminate the loop. This is
5831+
* to avoid sending two write operations, one with a
5832+
* length matching clone_len and the final one after
5833+
* this loop with a length of len - clone_len.
5834+
*
5835+
* When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
5836+
* was passed to the send ioctl), this helps avoid
5837+
* sending an encoded write for an offset that is not
5838+
* sector size aligned, in case the i_size of the source
5839+
* inode is not sector size aligned. That will make the
5840+
* receiver fallback to decompression of the data and
5841+
* writing it using regular buffered IO, therefore while
5842+
* not incorrect, it's not optimal due decompression and
5843+
* possible re-compression at the receiver.
5844+
*/
5845+
break;
58245846
} else {
58255847
ret = send_extent_data(sctx, dst_path, offset,
58265848
clone_len);

0 commit comments

Comments
 (0)