Skip to content

Redfs ubuntu resolute 7.0.0 14.14#147

Open
hbirth wants to merge 46 commits into
DDNStorage:redfs-ubuntu-resolute-7.0.0-14.14from
hbirth:redfs-ubuntu-resolute-7.0.0-14.14
Open

Redfs ubuntu resolute 7.0.0 14.14#147
hbirth wants to merge 46 commits into
DDNStorage:redfs-ubuntu-resolute-7.0.0-14.14from
hbirth:redfs-ubuntu-resolute-7.0.0-14.14

Conversation

@hbirth
Copy link
Copy Markdown
Collaborator

@hbirth hbirth commented Apr 28, 2026

No description provided.

bsbernd added 5 commits April 27, 2026 10:02
Rename trace_fuse_request_send to trace_fuse_request_enqueue
Add trace_fuse_request_send
Add trace_fuse_request_bg_enqueue
Add trace_fuse_request_enqueue

This helps to track entire request time and time in different
queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 4a7f142)
This is to allow copying into the buffer from the application
without the need to copy in ring context (and with that,
the need that the ring task is active in kernel space).

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 43d1a63)

(imported from commit ea01f94)
If pinned pages are used the application can write into these
pages and io_uring_cmd_complete_in_task() is not needed.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 5f0264c)
readhead is currently limited to bdi->ra_pages. One can change
that after the mount with something like

minor=$(stat -c "%d" /path/to/fuse)
echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb

Issue is that fuse-server cannot do that from its ->init method,
as it has to know about device minor, which blocks before
init is complete.

Fuse already sets the bdi value, but upper limit is the current
bdi value. For CAP_SYS_ADMIN we can allow higher values.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 763c96d)
Due to user buffer misalignent we actually need one page more,
i.e. 1025 instead of 1024, will be handled differently.
For now we just bump up the max.

(imported from commit 3f71501)
@hbirth hbirth requested a review from bsbernd April 28, 2026 12:13
bsbernd and others added 24 commits April 29, 2026 09:28
When having writeback cache enabled it is beneficial for data consistency
to communicate to the FUSE server when the kernel prepares a page for caching.
This lets the FUSE server react and lock the page.

Additionally the kernel lets the FUSE server decide how much data it locks by the
same call and keeps the given information in the dlm lock management.

If the feature is not supported it will be disabled after first unsuccessful use.

- Add DLM_LOCK fuse opcode
- Add cache page lock caching for writeback cache functionality.
This means sending out a FUSE call whenever the kernel prepares a page
for writeback cache. The kernel will manage the cache so that it will keep
track of already acquired locks.
(except for the case that is documented in the code)
- Use rb-trees for the management of the already 'locked' page ranges
- Use rw_semaphore for synchronization in fuse_dlm_cache

(imported from commit 287c884)
Renumber the operation code to a high value to avoid conflicts with upstream.

(imported from commit 27a0e9e)
Add support to invalidate inode aliases when doing inode invalidation.
This is useful for distributed file systems, which use DLM for cache
coherency. So, when a client losts its inode lock, it should invalidate
its inode cache and dentry cache since the other client may delete
this file after getting inode lock.

Signed-off-by: Yong Ze Chen <yochen@ddn.com>

(imported from commit 49720b5)
Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE
filesystems to acquire a distributed lock manager (DLM) lock for
protecting upcoming dirty pages when a previously read-only mapped
page is about to be written.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit ec36c45)
Allow read_folio to return EAGAIN error and translate it to
AOP_TRUNCATE_PAGE to retry page fault and read operations.
This is used to prevent deadlock of folio lock/DLM lock order reversal:
 - Fault or read operations acquire folio lock first, then DLM lock.
 - FUSE daemon blocks new DLM lock acquisition while it invalidating
   page cache. invalidate_inode_pages2_range() acquires folio lock
To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition
with EAGAIN if it detects an in-flight page cache invalidating
operation.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit 8ecf118)
generic/488 fails with fuse2fs in the following fashion:

generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)

This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.

Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts.  Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.

Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before fuse_abort_conn.  That way, all the pending events are processed
by the fuse server and we don't end up with a corrupt filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

(imported from commit d4262f9)
This is a preparation to allow fuse-io-uring bg queue
flush from flush_bg_queue()

This does two function renames:
fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
fuse_uring_abort_end_requests -> fuse_uring_flush_bg

And fuse_uring_abort_end_queue_requests() is moved to
fuse_uring_stop_queues().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit e70ef24)
This is useful to have a unique API to flush background requests.
For example when the bg queue gets flushed before
the remaining of fuse_conn_destroy().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit fc4120c)
When calling the fuse server with a dlm request and the fuse server
responds with some other error than ENOSYS most likely the lock size
will be set to zero. In that case the kernel will abort the fuse
connection. This is completely unnecessary.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 0bc2f9c)
Check whether dlm is still enabled when interpreting the returned
error from fuse server.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit f6fbf7c)
- Increase the possible lock size to 64 bit.
- change semantics of DLM locks to request start and end
- change semantics of DLM request return to mark start
and end of the locked area
- better prepare dlm lock range cache rb-tree
for unaligned byte range locks which could return
any value as long as it is larger than the range
requested
- add the case where start and end are zero
to destroy the cache

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 87968c7)
Fix reference count leak of payload pages during fuse argument copies.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit 8b75cf0)
This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>

(imported from commit e4698fa)
This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 2e27c33)
Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit be6edce)
Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
reduced queues are possible, i.e. if the flag is set, fuse server
is free to reduce number queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit f620f3d)
Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit c6399ea)
This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.

(imported from commit 2e73b0b)
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens

fuse_uring_queue_fuse_req -> no queue entry registered yet
    list_add_tail -> fuse request gets queued

So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.

(imported from commit 3bfb6cd)
fuse.h: add new opcode FUSE_COMPOUND

fuse_compound.c: add new functionality to pack multiple
fuse operations into one compound command

file.c: add an implementation of open+getattr

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit d9e7351)
(imported from commit 1607a03)
(imported from commit 9df5e4c)
(imported from commit 9921bcd)
(imported from commit 09d6f59)
(imported from commit 41b40bd)
There was a race between fuse_uring_cancel() and
fuse_uring_register()/fuse_uring_next_fuse_req(),
which comes from the queue reduction feature.

Race was

core-A                         core-B
fuse_uring_register
    spin_lock(&queue->lock);
    fuse_uring_ent_avail()
    spin_unlock(&queue->lock);

                                fuse_uring_cancel()
                                    spin_lock(&queue->lock);
                                    ent->state = FRRS_USERSPACE;
                                    list_move()

    fuse_uring_next_fuse_req()
        spin_lock(&queue->lock);
        fuse_uring_ent_avail(ent, queue);
        fuse_uring_send_next_to_ring()
        spin_unlock(&queue->lock);
        fuse_uring_send_next_to_ring

I.e. fuse_uring_ent_avail() was called two times and the 2nd time
when the entry was actually already handled by fuse_uring_cancel().

Solution is to not call fuse_uring_ent_avail() from
fuse_uring_register. With that the entry is not in state
FRRS_AVAILABLE and fuse_uring_cancel() will not touch it.
fuse_uring_send_next_to_ring() will mark it as FRRS_AVAILABLE,
and then either assign a request to it and change state again
or will not touch it at all anymore - race fixed.

This will be folded into the upstream queue reduction patches
and therefore has the RED-34640 commit message.

Also entirely removed is fuse_uring_do_register() as remaining
work can be done by the caller.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 932feba)
This is just to avoid code dup with an upcoming commit.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit ec3217f)
This issue could be observed sometimes during libfuse xfstests, from
dmseg prints some like "kernel: WARNING: CPU: 4 PID: 0 at
fs/fuse/dev_uring.c:204 fuse_uring_destruct+0x1f5/0x200 [fuse]".

The cause is, if when fuse daemon just submitted
FUSE_IO_URING_CMD_REGISTER SQEs, then umount or fuse daemon quits at
this very early stage. After all uring queues stopped, might have one or
more unprocessed FUSE_IO_URING_CMD_REGISTER SQEs get processed then some
new ring entities are created and added to ent_avail_queue, and
immediately fuse_uring_cancel moved them to ent_in_userspace after SQEs
get canceled. These ring entities were not moved to ent_released, and
stayed in ent_in_userspace when fuse_uring_destruct was called.

One way to solve it would be to also free 'ent_in_userspace' in
fuse_uring_destruct(), but from code point of view it is hard to see why
it is needed. As suggested by Joanne, another solution is to avoid moving
entries in fuse_uring_cancel() to the 'ent_in_userspace' list and just
releasing them directly.

Fixes: b6236c8 ("fuse: {io-uring} Prevent mount point hang on fuse-server termination")
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: <stable@vger.kernel.org> # v6.14
Signed-off-by: Jian Huang Li <ali@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 30d0473)
This fixes a memory leak.

(imported from commit f75b62f)
hbirth and others added 17 commits April 29, 2026 11:44
no functional changes

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit f0bccb2)
Take actions on the PR merged event of this repo. Run
copy-from-linux-branch.sh and create a PR for redfs.

(cherry picked from commit f54872e)

(imported from commit 522fddf)
Switch to pull_request_target instead of pull_request as the github
security requirement. Also limits the scope to protected PR.

(cherry picked from commit b9980ad)

(imported from commit e504e4a)
Remove the pull_request_target as it doesn't work.

(cherry picked from commit 5328f66)

(imported from commit 5277386)
For now compounds are a module option and disabled by default

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit f3b301d)
The use of bitmap_weight() didn't give the actual index,
but always returned the current cpu, which resulted
in a totally wrong mapping.

It now just increases a counter for every mapping and ignores
cores not in the given (numa) map and then find the index
for that.

Also added is a pr_debug(), which can be activated for example
with
echo "module redfs +p" >/proc/dynamic_debug/control
(Pity that upstream is not open for such debug messages).

(imported from commit bcbb684)
Fix the include sequence which causes a compiling error on aarch64.

(imported from commit f5fed0e)
Mapping might point to a totally different core due to
random assignment. For performance using the current
core might be beneficial

Example (with core binding)

unpatched WRITE: bw=841MiB/s
patched   WRITE: bw=1363MiB/s

With
fio --name=test --ioengine=psync --direct=1 \
    --rw=write --bs=1M --iodepth=1 --numjobs=1 \
    --filename_format=/redfs/testfile.\$jobnum --size=100G \
    --thread --create_on_open=1 --runtime=30s --cpus_allowed=1

In order to get the good number `--cpus_allowed=1` is needed.
This could be improved by a future change that avoids
cpu migration in fuse_request_end() on wake_up() call.

(imported from commit 32e0073)
Add a module parameter to enable large folio support.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 475371c)
compilation failed, due to external and static. The extern is actually
not needed, static is enough.

(imported from commit b2af4bd)
(imported from commit 5df77fb)
This is a DDN patch only, as unlock_request()/lock_request() solve
a deadlock issue for specially designed file systems,
see Documentation/filesystems/fuse.rst, in the section

**Scenario 2 - Tricky deadlock**

This one needs a carefully crafted filesystem.  It's a variation on
the above, only the call back to the filesystem is not explicit,
but is caused by a pagefault. ::

 |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2

In redfsd we do our best to not cause any kind of user issues
and just want to be as fast as possible. Hence, we do not
need the per page unlock/lock checks.
Given that fuse is a generic file system, this can be a DDN
commit only for now, until we find a better generic solution.
The unlock_request/lock_request functions have been replaced
by check_req_aborted(), which is run once per copied argument.

(imported from commit dc7fa1c)
Fix a race between fuse_iget() and fuse_reverse_inval_inode() where
invalidation can arrive while an inode is being initialized, causing
the invalidation to be lost.

Add a waitqueue to make fuse_reverse_inval_inode() wait when it
encounters an inode with attr_version == 0 (still initializing).
When fuse_change_attributes_common() completes initialization, it
wakes waiting threads.

This ensures invalidations are properly serialized with inode
initialization, maintaining cache coherency.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 03eacfd)
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit ad21e5a)
Fix uninterruptible sleep (D state) hangs during FUSE filesystem
teardown when using io_uring. The issue manifests as processes stuck
waiting for requests that are never completed, particularly affecting
force requests like FUSE_FLUSH or when requests are created after
fuse_abort_conn() already finished.

If on daemon exit
io_uring_try_cancel_requests() runs and  calls fuse_uring_cancel()
which will teardown the entries by calling fuse_uring_entry_teardown()
before fuse_abort_conn() then we end up in fuse_uring_abort with
queue_refs == 0 and the queues are never stopped.

If the queues are stopped all new requests will be rejected, but
that does not happen, so all new calls are stuck.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 9550b4d)
Fixes xfstests generic/451, similar to how commit b359af8 ("fuse:
Invalidate the page cache after FOPEN_DIRECT_IO write") fixes xfstests
generic/209.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit 51e0799)
Add security_inode_invalidate_secctx() call to invalidate cached
security context when inode attributes change. This ensures that
SELinux security labels are properly refreshed and prevents stale
security context from being used after inode modifications.

Signed-off-by: Kevin Chen <kchen@ddn.com>

(imported from commit 6c9ec1d)
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
@hbirth hbirth force-pushed the redfs-ubuntu-resolute-7.0.0-14.14 branch from 199f0a1 to b321885 Compare April 29, 2026 09:50
@hbirth hbirth requested review from cding-ddn, openunix and yongzech May 18, 2026 07:17
@hbirth
Copy link
Copy Markdown
Collaborator Author

hbirth commented May 18, 2026

current xfstests with passthourgh_hp give us

Failures: generic/020 generic/062 generic/080 generic/120 generic/184 generic/215 generic/434 generic/531 generic/631 generic/633 generic/684
Failed 11 of 781 tests

@bsbernd
Copy link
Copy Markdown
Collaborator

bsbernd commented May 18, 2026

generic/631 generic/633 generic/684
Failed 11 of 781 tests

That is great, down to 11!

@hbirth
Copy link
Copy Markdown
Collaborator Author

hbirth commented May 18, 2026

generic/631 generic/633 generic/684
Failed 11 of 781 tests

That is great, down to 11!

This is Linux 7.0 ... there's a lot of work that was done since 6.8 ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants