Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) by krasznaa · Pull Request #1259 · acts-project/traccc

krasznaa · 2026-02-12T16:52:50Z

Following up on #1240, this is finally synchronizing the behaviour of the Alpaka CKF algorithm with the CUDA and SYCL ones. Using the same code re-write done in previous "harmonization PRs".

While at it, I also added unit tests for the Alpaka CKF algorithm. These unit tests will need to be re-designed a bit in a future PR to reduce the amount of code duplication. But didn't want to bother with that in this PR.

Note that one of the tests I cannot run successfully with Alpaka on a CPU. 😕 While the other test runs on a CPU happily. And I was also able to get reasonable outputs from various example binaries with the CKF. So I'm not sure what that particular test has against CPU running. (With Alpaka's CUDA backend it runs fine.) But I just gave up on understanding that after about an hour of looking at it.

Once I added the latest kernels/device functions to Alpaka, GCC flagged a few things in that code. 🤔 The complaint about candidate_link variables not being initialized could be GCC just being too afraid. But setting an initial value to out_idx I believe was a good find by the compiler. 🤔

stephenswat

Looks okay, but far too verbose with the payload structs. Try to deduplicate those using the existing structs we have.

stephenswat · 2026-02-13T07:55:12Z

+        // Here we could give control back to the caller, once our code allows
+        // for it. (coroutines...)


Instead of copying the same unstructured comment six times across this file, prepend them with TODO: to make them findable.

TODO is flagged by SonarCloud. This is not.

I'm copying the same sentence to make it easily searchable in our code once we embark on such a code change.

TODO is flagged by SonarCloud.

That's exactly the point: SonarCloud and other tools give you a list of code sections marked with TODO (and FIXME, etc.) comments so that you can easily find them. 😛

But it's not obvious that we will want to do anything here. Not to me. Not yet.

stephenswat

Starting to look a bit better. 👍

stephenswat · 2026-02-16T09:40:38Z

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel	Reciprocal Throughput			Parallelism
Kernel	`0158246`	`e0e2ca5`	Delta	`0158246`	`e0e2ca5`
`propagate_to_next_surface`	7.84 ms	6.49 ms	-17.2%	3.45	4.09
`find_tracks`	1.74 ms	1.74 ms	0.0%	1.83	1.83
`ccl_kernel`	824.83 μs	825.14 μs	0.0%	1.37	1.37
`count_doublets`	811.36 μs	814.67 μs	0.4%	1.61	1.61
`count_triplets`	568.02 μs	567.41 μs	-0.1%	1.02	1.02
`find_doublets`	532.92 μs	532.45 μs	-0.1%	3.08	3.08
`Thrust::sort`	379.63 μs	379.50 μs	-0.0%	7.32	7.32
`find_triplets`	170.65 μs	170.34 μs	-0.2%	1.31	1.31
`estimate_track_params`	146.39 μs	146.37 μs	-0.0%	2.68	2.68
`build_tracks`	125.26 μs	125.67 μs	0.3%	3.71	3.71
`select_seeds`	58.91 μs	59.45 μs	0.9%	1.34	1.34
`populate_grid`	24.00 μs	24.05 μs	0.2%	1.22	1.22
`remove_duplicates`	23.39 μs	23.52 μs	0.6%	26.19	26.10
`count_grid_capacities`	22.12 μs	22.15 μs	0.2%	1.22	1.22
`fill_sorted_measurements`	19.77 μs	19.78 μs	0.0%	1.13	1.13
`update_triplet_weights`	14.77 μs	14.79 μs	0.1%	1.27	1.27
`apply_interaction`	13.89 μs	13.85 μs	-0.3%	6.70	6.72
`fill_finding_propagation_sort_keys`	8.84 μs	8.81 μs	-0.4%	7.64	7.66
`form_spacepoints`	8.33 μs	8.33 μs	0.0%	1.48	1.48
`reduce_triplet_counts`	5.60 μs	5.62 μs	0.4%	3.09	3.09
`unknown`	5.08 μs	5.08 μs	-0.1%	4.26	4.27
`fill_finding_duplicate_removal_sort_keys`	1.57 μs	1.57 μs	-0.0%	37.98	38.00
Total	13.35 ms	12.00 ms	-10.1%	2.98	3.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

stephenswat · 2026-02-16T09:59:07Z

Performance summary

This looks good!

stephenswat · 2026-02-16T10:01:40Z

But FYI: the physics CI currently fails with an illegal memory access was encountered.

krasznaa · 2026-02-16T10:28:51Z

But FYI: the physics CI currently fails with an illegal memory access was encountered.

😦 To be fixed then...

krasznaa · 2026-02-16T14:35:58Z

I did not manage to reproduce a crash with:

./bin/traccc_seeding_example_cuda --input-directory=/home/krasznaa/ATLAS/data/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

I even tried 2 different CUDA versions.

Could you re-check @stephenswat? If you still see a crash, I'll need to test on the same node. 🤔

stephenswat · 2026-02-16T14:46:23Z

Physics performance summary

Here is a summary of the physics performance effects of this PR. Command used:

traccc_seeding_example_cuda --input-directory=/data/Acts/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

Seeding performance

Total number of seeds went from 298344 to 298340 (-0.0%)

Seeding plots

Track finding performance

Total number of found tracks went from 50221 to 50224 (+0.0%)

Finding plots

Track fitting performance

Fitting plots

Seeding to track finding relative performance

Seeding to track finding plots

Note

This is an automated message produced on the explicit request of a human being.

stephenswat · 2026-02-16T15:19:29Z

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel	Reciprocal Throughput			Parallelism
Kernel	`0158246`	`d00f44e`	Delta	`0158246`	`d00f44e`
`propagate_to_next_surface`	7.83 ms	6.49 ms	-17.1%	3.45	4.09
`find_tracks`	1.74 ms	1.74 ms	0.2%	1.83	1.83
`ccl_kernel`	826.95 μs	825.77 μs	-0.1%	1.37	1.37
`count_doublets`	818.44 μs	814.08 μs	-0.5%	1.61	1.61
`count_triplets`	568.02 μs	568.85 μs	0.1%	1.02	1.02
`find_doublets`	534.21 μs	535.06 μs	0.2%	3.08	3.08
`Thrust::sort`	379.77 μs	379.41 μs	-0.1%	7.32	7.32
`find_triplets`	171.31 μs	170.41 μs	-0.5%	1.31	1.31
`estimate_track_params`	146.52 μs	146.40 μs	-0.1%	2.68	2.68
`build_tracks`	125.29 μs	125.36 μs	0.1%	3.72	3.71
`select_seeds`	58.47 μs	58.10 μs	-0.6%	1.34	1.34
`populate_grid`	24.02 μs	23.98 μs	-0.2%	1.22	1.22
`remove_duplicates`	23.54 μs	23.55 μs	0.0%	26.06	26.06
`count_grid_capacities`	22.17 μs	22.22 μs	0.2%	1.22	1.22
`fill_sorted_measurements`	19.77 μs	19.71 μs	-0.3%	1.13	1.13
`update_triplet_weights`	14.75 μs	14.80 μs	0.4%	1.27	1.27
`apply_interaction`	13.89 μs	13.86 μs	-0.2%	6.71	6.71
`fill_finding_propagation_sort_keys`	8.82 μs	8.81 μs	-0.1%	7.66	7.67
`form_spacepoints`	8.35 μs	8.33 μs	-0.3%	1.48	1.49
`reduce_triplet_counts`	5.67 μs	5.63 μs	-0.6%	3.08	3.09
`unknown`	5.07 μs	5.08 μs	0.1%	4.26	4.26
`fill_finding_duplicate_removal_sort_keys`	1.57 μs	1.57 μs	0.0%	38.00	38.05
Total	13.34 ms	12.00 ms	-10.0%	2.98	3.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

stephenswat · 2026-02-16T15:54:54Z

Interestingly testing on pcadp04 reveals exactly the opposite behaviour:

./build/bin/traccc_throughput_mt_cuda --material-file=geometries/odd/odd-detray_material_detray.json --input-directory=/data/Acts
/odd-simulations-20240506/geant4_ttbar_mu140 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json
--grid-file=geometries/odd/odd-detray_surface_grids_detray.json --input-events=10 --cold-run-events=100 --processed-events=1000 --use-acts-geom-source=on --read-bfield-from-f
ile --cpu-threads=20 --track-candidates-range=5:20 --seedfinder-vertex-range=-150:150 --deterministic --initial-links-per-seed=6

At current main:

16:52:40    ThroughputExample             INFO      Reconstructed track parameters: 2909792
16:52:40    ThroughputExample             INFO      Time totals:                   File reading  529 ms
16:52:40    ThroughputExample             INFO                  Warm-up processing  744 ms
16:52:40    ThroughputExample             INFO                    Event processing  6612 ms
16:52:40    ThroughputExample             INFO      Throughput:            Warm-up processing  7.44117 ms/event, 134.388 events/s
16:52:40    ThroughputExample             INFO                    Event processing  6.61275 ms/event, 151.223 events/s

With this PR:

16:54:29    ThroughputExample             INFO      Reconstructed track parameters: 2909798
16:54:29    ThroughputExample             INFO      Time totals:                   File reading  491 ms
16:54:29    ThroughputExample             INFO                  Warm-up processing  829 ms
16:54:29    ThroughputExample             INFO                    Event processing  7425 ms
16:54:29    ThroughputExample             INFO      Throughput:            Warm-up processing  8.29486 ms/event, 120.557 events/s
16:54:29    ThroughputExample             INFO                    Event processing  7.42517 ms/event, 134.677 events/s

So that would rather be a 10% slowdown. Fascinating!

krasznaa · 2026-02-16T17:09:43Z

Well, this is worrisome. 🤔 I don't claim to fully understand the situation, but it seems that cudaEventSynchronize(...) is less efficient than cudaStreamSynchronize(...).

This is how a 10-thread job (with the same parameters as posted in the previous comment) looks like in the current main branch:

While this is how it looks with this branch's code:

In this code size copies go through vecmem::async_size. Which relies on event and not stream synchronization.

But I have some doubts about the answer being quite so simple. 🤔 I'll do some profiling tomorrow with NSys as well.

krasznaa · 2026-02-16T17:21:14Z

Ahh, never mind. When I actually add up all the time that is spent in cuEventSynchronize and cuStreamSynchronize in both cases, I come to pretty much the same value. The proportions between these two types have shifted. But the total time spent in them didn't change in any meaningful way. 🤔

stephenswat · 2026-02-17T09:17:11Z

Throughputs on the A5000:

0158246 (main): 151 Hz
ce6c6c3 (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259~): 139 Hz
d00f44e (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259): 134 Hz
d00f44e (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259) with the kernel arguments marked as const and grid constant: 139 Hz

stephenswat · 2026-02-17T09:26:51Z

Regarding the d00f44e commit, what happens here is that the register usage changes (probably due to the kernel arguments) which increases occupancy but also increases register spilling. So the compiler is doing a poor job optimising there, and the CI benchmark is being tricked by the increased occupancy.

stephenswat · 2026-02-17T09:29:28Z

The performance change in 8e67711 is easily explained by the fact that the block sizes change:

main:

const unsigned int nThreads = warp_size * 4;
...

8e67711:

const unsigned int deviceThreads = warp_size() * 2;
...

However, 8e67711 is also the commit that reduces throughput from 151 Hz to 139 Hz.

krasznaa · 2026-02-17T13:09:08Z

Indeed I increased the block size in some cases. If that's the culprit, that would be a pretty clean issue to fix.

There were some comments here and there in the CUDA code for some of the block size choices. But not for all of them. I remember that one of those didn't seem to make sense for me, so I changed it on purpose.

I'll do some tests of my own on an L40s a little later today, and let you know what I find. Your findings are very useful, to be very clear about that.

This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.

krasznaa · 2026-02-25T09:43:03Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

I'll do some further work a bit later on. 🤔

stephenswat · 2026-02-25T10:09:56Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

krasznaa · 2026-02-25T10:27:38Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Let's see...

stephenswat · 2026-02-25T10:38:33Z

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Unfortunately the results I collect are with the event pooling already enabled, so I am afraid that this won't help.

sonarqubecloud · 2026-02-26T16:00:39Z

Quality Gate passed

Issues
16 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
31.3% Duplication on New Code

See analysis details on SonarQube Cloud

This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.

krasznaa · 2026-05-06T07:11:23Z

I have some new insights on this PR. 🤔 For some not-yet-understood reason this rewrite of the CKF algorithm behaves badly in a multi-threaded environment.

I wonder how the attached PDF will show up on GitHub, but this shows the issue pretty nicely:

results.pdf

So now it's time to profile the job for its multi-threading efficiency...

krasznaa · 2026-05-06T12:39:28Z

I have some new insights on this PR. 🤔 For some not-yet-understood reason this rewrite of the CKF algorithm behaves badly in a multi-threaded environment.

I wonder how the attached PDF will show up on GitHub, but this shows the issue pretty nicely:

results.pdf

So now it's time to profile the job for its multi-threading efficiency...

I believe I have somewhat of an understanding of "the situation" now. 🤔

The "threading analysis" in VTune didn't reveal anything. 😦 When it comes to actual mutexes and similar, the main branch and this PR behave pretty much in an identical way. So it's not that this PR's code would be passively waiting for things at any point.

Rather, it's a "hotspots analysis" that's more insightful. Looking at the most expensive individual functions during a representative job, I see the following in the main branch:

And this is what I see with this PR:

This PR does in fact turn all cudaStreamSynchronize(...) calls (used through vecmem::cuda::stream_wrapper::synchronize()) into cudaEventSynchronize(...) ones (used through ::cuda_event::wait()). Clearly the latter is a lot less MT friendly than the former. 🤔

Let me tag @m-fila and @ericcano on this. 🤔 As I think this goes well in the direction that Eric has been working on since a while.

The finding was that cudaEventSynchronize(...) is not great, wasn't it? That we should rather poll the status of events by hand if at all possible, right?

I'll need to think a bit what we could realistically implement out of that for the examples of this repository, and then for the examples that we'll write for Acts. Since for ATLAS offline I do have some ideas of what we'd do in the long term. 🤔

m-fila · 2026-05-06T15:05:11Z

Let me tag @m-fila and @ericcano on this. 🤔 As I think this goes well in the direction that Eric has been working on since a while.

The finding was that cudaEventSynchronize(...) is not great, wasn't it? That we should rather poll the status of events by hand if at all possible, right?

Thanks for tagging us. I don't remember us being unhappy with cudaEventSynchronize (the poor scaling Eric originally observed disappeared after event pooling optimization in vecmem).

I've done measurements on Nvidia L40s for clustering and seeding stages with async handlers doing either cudaEventSynchronize or cudaStreamSynchronize (on the current thread so not really async, there is no suspension). In both cases the code stayed the the same, only different handler was injected. Events and device had default flags (so spinning during synchronization). The differences were less than 1%.

So I'd say it's something new to me

…ion_kernel_payload. Modified device::apply_interaction_payload not to be templated, by device::apply_interaction receiving the detector view as a separate argument. And then updated all the clients of the common algorithm base class to implement their versions of apply_interaction_kernel accordingly.

…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.

…ext_surface_kernel_payload.

In a very specific build setup the compilation got upset about not encountering template specializations in the correct order. While I'm not sure why that is (the include setup, while not perfect, should technically work currently), the fix does make the code look a bit nicer. So it might as well go in.

Mostly to the dropping of the apply_interactions step in the CKF, and the change in the edm::measurement definition.

Updated the tests for the edm::measurement changes. And changed the build such that it would only use "the Alpaka compiler" for the test(s) that need to build device code directly. Thereby sidestepping issues coming from NVCC being directly exposed to edm::<Foo>::host types.

sonarqubecloud · 2026-05-07T12:13:13Z

Quality Gate passed

Issues
15 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
31.1% Duplication on New Code

See analysis details on SonarQube Cloud

krasznaa requested a review from stephenswat February 12, 2026 16:52

krasznaa added refactor Change the structure of the code cleanup Makes the code all clean and tidy cuda Changes related to CUDA sycl Changes related to SYCL alpaka Changes related to Alpaka labels Feb 12, 2026

stephenswat requested changes Feb 13, 2026

View reviewed changes

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from b9265c0 to e0e2ca5 Compare February 13, 2026 16:26

stephenswat requested changes Feb 16, 2026

View reviewed changes

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from e0e2ca5 to d00f44e Compare February 16, 2026 14:28

stephenswat reviewed Feb 19, 2026

View reviewed changes

Comment thread device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp Outdated

stephenswat reviewed Feb 19, 2026

View reviewed changes

Comment thread device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp Outdated

stephenswat mentioned this pull request Feb 24, 2026

Add grid_constant qualifier to finding kernels #1266

Merged

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from d00f44e to 08f0655 Compare February 25, 2026 09:27

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from c5fdf02 to 1eb7409 Compare February 26, 2026 15:58

krasznaa force-pushed the CKFHarmonization-main-20260210 branch 2 times, most recently from fa7320b to 4ff1183 Compare April 22, 2026 13:54

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from 4ff1183 to 2bcdf88 Compare May 6, 2026 07:08

krasznaa added 14 commits May 7, 2026 13:21

Introduced a common base class for the device CKF algorithms.

c40032c

Migrated the CUDA CKF algorithm to the common base class.

201087e

Migrated the SYCL CKF algorithm to the common base class.

1e8b043

Migrated the Alpaka CKF algorithm to the common base class.

a55dcbe

Added unit tests for the Alpaka CKF algorithm.

401df70

Address issues flagged by GCC.

fa57d50

Removed device::combinatorial_kalman_filter_algorithm::find_tracks_ke…

56f517f

…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.

Removed device::combinatorial_kalman_filter_algorithm::propagate_to_n…

0ebeb6a

…ext_surface_kernel_payload.

Adjust kernel launch parameters.

6c46e1c

Address more of the PR comments.

4dd155a

Adjusted the CKF algorithms to various recent changes.

a56784b

Mostly to the dropping of the apply_interactions step in the CKF, and the change in the edm::measurement definition.

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from 2bcdf88 to 4068491 Compare May 7, 2026 12:11

		// Here we could give control back to the caller, once our code allows
		// for it. (coroutines...)

Conversation

krasznaa commented Feb 12, 2026

Uh oh!

stephenswat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenswat Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krasznaa Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

stephenswat Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

krasznaa Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

stephenswat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenswat commented Feb 16, 2026

Performance summary

Graphical

Tabular

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 16, 2026

Physics performance summary

Seeding performance

Track finding performance

Track fitting performance

Seeding to track finding relative performance

Uh oh!

stephenswat commented Feb 16, 2026

Performance summary

Graphical

Tabular

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 17, 2026

Uh oh!

stephenswat commented Feb 17, 2026

Uh oh!

stephenswat commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krasznaa commented Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

krasznaa commented Feb 25, 2026

Uh oh!

stephenswat commented Feb 25, 2026

Uh oh!

krasznaa commented Feb 25, 2026

Uh oh!

stephenswat commented Feb 25, 2026

stephenswat Feb 13, 2026 •

edited

Loading

stephenswat commented Feb 17, 2026 •

edited

Loading

krasznaa commented May 6, 2026 •

edited

Loading

m-fila commented May 6, 2026 •

edited

Loading