Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259
Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259krasznaa wants to merge 14 commits intoacts-project:mainfrom
Conversation
stephenswat
left a comment
There was a problem hiding this comment.
Looks okay, but far too verbose with the payload structs. Try to deduplicate those using the existing structs we have.
| // Here we could give control back to the caller, once our code allows | ||
| // for it. (coroutines...) |
There was a problem hiding this comment.
Instead of copying the same unstructured comment six times across this file, prepend them with TODO: to make them findable.
There was a problem hiding this comment.
TODO is flagged by SonarCloud. This is not.
I'm copying the same sentence to make it easily searchable in our code once we embark on such a code change.
There was a problem hiding this comment.
TODOis flagged by SonarCloud.
That's exactly the point: SonarCloud and other tools give you a list of code sections marked with TODO (and FIXME, etc.) comments so that you can easily find them. 😛
There was a problem hiding this comment.
But it's not obvious that we will want to do anything here. Not to me. Not yet.
b9265c0 to
e0e2ca5
Compare
stephenswat
left a comment
There was a problem hiding this comment.
Starting to look a bit better. 👍
Performance summaryHere is a summary of the performance effects of this PR: GraphicalTabular
Important All metrics in this report are given as reciprocal throughput, not as wallclock runtime. Note This is an automated message produced upon the explicit request of a human being. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This looks good! |
|
But FYI: the physics CI currently fails with |
😦 To be fixed then... |
e0e2ca5 to
d00f44e
Compare
|
I did not manage to reproduce a crash with: I even tried 2 different CUDA versions. Could you re-check @stephenswat? If you still see a crash, I'll need to test on the same node. 🤔 |
Physics performance summaryHere is a summary of the physics performance effects of this PR. Command used: Seeding performanceTotal number of seeds went from 298344 to 298340 (-0.0%) Track finding performanceTotal number of found tracks went from 50221 to 50224 (+0.0%) Track fitting performanceSeeding to track finding relative performanceNote This is an automated message produced on the explicit request of a human being. |
Performance summaryHere is a summary of the performance effects of this PR: GraphicalTabular
Important All metrics in this report are given as reciprocal throughput, not as wallclock runtime. Note This is an automated message produced upon the explicit request of a human being. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Interestingly testing on At current main: With this PR: So that would rather be a 10% slowdown. Fascinating! |
|
Ahh, never mind. When I actually add up all the time that is spent in |
|
Throughputs on the A5000:
|
|
Regarding the d00f44e commit, what happens here is that the register usage changes (probably due to the kernel arguments) which increases occupancy but also increases register spilling. So the compiler is doing a poor job optimising there, and the CI benchmark is being tricked by the increased occupancy. |
|
Indeed I increased the block size in some cases. If that's the culprit, that would be a pretty clean issue to fix. There were some comments here and there in the CUDA code for some of the block size choices. But not for all of them. I remember that one of those didn't seem to make sense for me, so I changed it on purpose. I'll do some tests of my own on an L40s a little later today, and let you know what I find. Your findings are very useful, to be very clear about that. |
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.
d00f44e to
08f0655
Compare
|
With I'll do some further work a bit later on. 🤔 |
On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁 |
One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations. Let's see... |
Unfortunately the results I collect are with the event pooling already enabled, so I am afraid that this won't help. |
c5fdf02 to
1eb7409
Compare
|
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.
fa7320b to
4ff1183
Compare
4ff1183 to
2bcdf88
Compare
|
I have some new insights on this PR. 🤔 For some not-yet-understood reason this rewrite of the CKF algorithm behaves badly in a multi-threaded environment. I wonder how the attached PDF will show up on GitHub, but this shows the issue pretty nicely: So now it's time to profile the job for its multi-threading efficiency... |
I believe I have somewhat of an understanding of "the situation" now. 🤔 The "threading analysis" in VTune didn't reveal anything. 😦 When it comes to actual mutexes and similar, the main branch and this PR behave pretty much in an identical way. So it's not that this PR's code would be passively waiting for things at any point. Rather, it's a "hotspots analysis" that's more insightful. Looking at the most expensive individual functions during a representative job, I see the following in the main branch:
And this is what I see with this PR:
This PR does in fact turn all Let me tag @m-fila and @ericcano on this. 🤔 As I think this goes well in the direction that Eric has been working on since a while. The finding was that I'll need to think a bit what we could realistically implement out of that for the examples of this repository, and then for the examples that we'll write for Acts. Since for ATLAS offline I do have some ideas of what we'd do in the long term. 🤔 |
Thanks for tagging us. I don't remember us being unhappy with I've done measurements on Nvidia L40s for clustering and seeding stages with async handlers doing either
So I'd say it's something new to me |
…ion_kernel_payload. Modified device::apply_interaction_payload not to be templated, by device::apply_interaction receiving the detector view as a separate argument. And then updated all the clients of the common algorithm base class to implement their versions of apply_interaction_kernel accordingly.
…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.
…ext_surface_kernel_payload.
In a very specific build setup the compilation got upset about not encountering template specializations in the correct order. While I'm not sure why that is (the include setup, while not perfect, should technically work currently), the fix does make the code look a bit nicer. So it might as well go in.
Mostly to the dropping of the apply_interactions step in the CKF, and the change in the edm::measurement definition.
Updated the tests for the edm::measurement changes. And changed the build such that it would only use "the Alpaka compiler" for the test(s) that need to build device code directly. Thereby sidestepping issues coming from NVCC being directly exposed to edm::<Foo>::host types.
2bcdf88 to
4068491
Compare
|







































Following up on #1240, this is finally synchronizing the behaviour of the Alpaka CKF algorithm with the CUDA and SYCL ones. Using the same code re-write done in previous "harmonization PRs".
While at it, I also added unit tests for the Alpaka CKF algorithm. These unit tests will need to be re-designed a bit in a future PR to reduce the amount of code duplication. But didn't want to bother with that in this PR.
Note that one of the tests I cannot run successfully with Alpaka on a CPU. 😕 While the other test runs on a CPU happily. And I was also able to get reasonable outputs from various example binaries with the CKF. So I'm not sure what that particular test has against CPU running. (With Alpaka's CUDA backend it runs fine.) But I just gave up on understanding that after about an hour of looking at it.
Once I added the latest kernels/device functions to Alpaka, GCC flagged a few things in that code. 🤔 The complaint about
candidate_linkvariables not being initialized could be GCC just being too afraid. But setting an initial value toout_idxI believe was a good find by the compiler. 🤔