Skip to content

Use RTLD_NODELETE when loading the render engine plugin#1280

Merged
iche033 merged 2 commits intogazebosim:mainfrom
taylorhoward92:fix/reload-engine-rtld-nodelete
Apr 29, 2026
Merged

Use RTLD_NODELETE when loading the render engine plugin#1280
iche033 merged 2 commits intogazebosim:mainfrom
taylorhoward92:fix/reload-engine-rtld-nodelete

Conversation

@taylorhoward92
Copy link
Copy Markdown
Contributor

@taylorhoward92 taylorhoward92 commented Apr 25, 2026

🦟 Bug fix

Fixes #1265

Summary

Repeated unloadEngine / engine cycles dlopen and dlclose the engine plugin. A transitive dependency in the plugin's chain (libgomp on Ubuntu Noble + rotary nightly packages, via libgz-common-graphicslibassimp and friends) uses thread-local storage. Without RTLD_NODELETE, each reload allocates from glibc's static-TLS surplus, which is not reliably reclaimed on dlclose. After ~10 cycles the surplus is exhausted and the next dlopen fails with:

Error while loading the library [/usr/local/lib/libgz-rendering-ogre2.so]: /lib/x86_64-linux-gnu/libgomp.so.1: cannot allocate memory in static TLS block

This regression appeared on main after #1246 switched CI to the rotary alias packages, which (via the gz-common rebuild that swapped FreeImage for vendored STB in gz-common#803) no longer transitively pull libfreeimagelibrawlibgomp into the test binary's startup DT_NEEDED. The static-TLS exhaustion was latent before that change — libfreeimage's dep chain was anchoring libgomp in the main binary's TLS region for free.

The fix passes _noDelete=true to gz::plugin::Loader::LoadLib, which gates the RTLD_NODELETE flag that's already supported by gz-plugin's loader. With this flag, dlclose keeps the library mapped, finalizers don't run, and
TLS slots aren't released. TLS is allocated once on first load and reused on every subsequent reload, eliminating the surplus leak regardless of which library in the plugin's chain is the TLS hog.

Reproduction

REGRESSION_reload_engine_ogre2_gl3plus reproduces the failure 100% of the time inside a fresh ubuntu:noble docker container with gzdev repository enable --project=rotary (matching what gazebo-tooling/action-gz-ci@noble does). Before this PR: 5 cases fail with the static-TLS error. After this PR: 8/8 cases pass.

Trade-off

The engine plugin and its transitive dependencies remain mapped for the lifetime of the process. For a rendering engine this is effectively the process's lifetime anyway, Ogre::Root is created and destroyed by Ogre2RenderEngine at the C++ level, separately from library load/unload, so calling unloadEngine followed by engine() still produces a fresh Ogre::Root. What changes:

  • Static constructors in the plugin (and its deps) run once per process, not once per load.
  • Static destructors run only at process exit.
  • Memory grows monotonically; once mapped, never unmapped (~50 MB one-time).
  • Hot-reload from disk (recompile + reload without restart) would no longer work for engine plugins.

Alternatives considered

  • Linking -Wl,--no-as-needed -lgomp into the test executable (initial attempt): works but only patches the test, doesn't fix the underlying reload bug for downstream consumers, GCC-specific, and only handles libgomp.
  • GLIBC_TUNABLES=glibc.rtld.optional_static_tls=N: runtime-only, delays exhaustion rather than fixing it, and requires CI-environment plumbing.
  • Trimming the plugin's transitive NEEDED chain: the chain is largely load-bearing (the plugin uses ~80 symbols from libgz-common-graphics); not a productive direction.

RTLD_NODELETE is the only option that fixes the bug for every gz::rendering::engine() consumer and not just this one test.

Checklist

  • Signed all commits for DCO
  • Added a screen capture or video to the PR description that demonstrates the fix (as needed)
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • Consider updating Python bindings (if the library has them)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage) — verified end-to-end on Ubuntu Noble x86_64 in a fresh ubuntu:noble docker container with gzdev rotary packages, the previously-failing REGRESSION_reload_engine_ogre2_gl3plus now passes (5.18 s, 8/8 cases).
  • Updated Bazel files (if adding new files). Created an issue otherwise.
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers
  • Was GenAI used to generate this PR? If so, make sure to add "Generated-by" to your commits. (See this policy for more info.)

Generated-by: Claude Code

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by and Generated-by messages.

Repeated unloadEngine/engine cycles dlopen and dlclose the engine
plugin.  Some transitive dependency (libgomp on Ubuntu Noble + rotary
nightly packages) uses thread-local storage; without RTLD_NODELETE,
each reload allocates from glibc's static-TLS surplus, which is not
reliably reclaimed on dlclose.  After ~10 cycles the surplus is
exhausted and the next dlopen fails with
  "cannot allocate memory in static TLS block"

Pass _noDelete=true to gz-plugin's Loader::LoadLib so dlclose keeps
the library mapped.  TLS is allocated once on first load and reused
on every subsequent reload, eliminating the leak.

Trade-off: the plugin and its transitive deps remain resident for
the lifetime of the process.  For a rendering engine this is the
expected lifetime anyway.

Fixes gazebosim#1265

Generated-by: Claude Opus 4.7
Signed-off-by: Taylor Howard <taylorhoward@me.com>
@taylorhoward92 taylorhoward92 marked this pull request as ready for review April 25, 2026 04:14
@taylorhoward92 taylorhoward92 requested a review from iche033 as a code owner April 25, 2026 04:14
@iche033
Copy link
Copy Markdown
Contributor

iche033 commented Apr 27, 2026

thanks for tracking this down. Verified that the changes fix the issue in local testing.

@iche033
Copy link
Copy Markdown
Contributor

iche033 commented Apr 27, 2026

would be good to get another pair of eyes on this to ensure this is safe, maybe @azeey

Copy link
Copy Markdown
Contributor

@azeey azeey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating this! The explanation makes sense, but I don't know why the test passes on Jenkins (e.g. https://build.osrfoundation.org/view/gz-rotary/job/gz_rendering-ci-main-noble-amd64/198/), but fails on Github Actions.

I think we can live with the trade-offs. Other than tests, I don't think we have a common use case where we want the rendering engine to be repeatedly loaded/unloaded. We do have some some variables with static storage. Checking with nm -SlC lib/libgz-rendering-ogre2.dylib | grep -F ' b ' | v, I see the following non-const variables that will not be reinitialized:

gz::rendering::v11::Ogre2RenderTarget::TargetFSAA(unsigned char)::ogre2FSAAWarn
gz::rendering::v11::Ogre2DynamicRenderable::CreateDynamicMesh()::dynamicRenderableId
gz::rendering::v11::Ogre2GaussianNoisePass::CreateRenderPass()::gaussianNodeCounter
gz::rendering::v11::Ogre2DepthGaussianNoisePass::CreateRenderPass()::gaussianDepthNodeCounter

and the following static initializers that affect singletons from

#define GZ_RENDERING_REGISTER_RENDER_PASS(classname, interface) \
:

global_Ogre2LensFlarePassFactory
global_Ogre2GaussianNoisePassFactory

@iche033 will these affect tests?

Overall, we've done this type of fix in other places (e.g. gazebosim/gz-sim#1649), so I think it's okay to merge even if there are still unanswered questions.

@iche033
Copy link
Copy Markdown
Contributor

iche033 commented Apr 29, 2026

We do have some some variables with static storage.

Good point. I don't think it'll affect the the tests and typical usage of gz-rendering. But I ticketed an issue to track this: #1285

@iche033 iche033 merged commit fb9dd4c into gazebosim:main Apr 29, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in Core development Apr 29, 2026
@iche033
Copy link
Copy Markdown
Contributor

iche033 commented Apr 29, 2026

@Mergifyio backport gz-rendering10 gz-rendering9 gz-rendering8

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 29, 2026

backport gz-rendering10 gz-rendering9 gz-rendering8

✅ Backports have been created

Details

@taylorhoward92 taylorhoward92 deleted the fix/reload-engine-rtld-nodelete branch April 30, 2026 11:46
iche033 pushed a commit that referenced this pull request Apr 30, 2026
Repeated unloadEngine/engine cycles dlopen and dlclose the engine
plugin.  Some transitive dependency (libgomp on Ubuntu Noble + rotary
nightly packages) uses thread-local storage; without RTLD_NODELETE,
each reload allocates from glibc's static-TLS surplus, which is not
reliably reclaimed on dlclose.  After ~10 cycles the surplus is
exhausted and the next dlopen fails with
  "cannot allocate memory in static TLS block"

Pass _noDelete=true to gz-plugin's Loader::LoadLib so dlclose keeps
the library mapped.  TLS is allocated once on first load and reused
on every subsequent reload, eliminating the leak.

Trade-off: the plugin and its transitive deps remain resident for
the lifetime of the process.  For a rendering engine this is the
expected lifetime anyway.

Fixes #1265

Generated-by: Claude Opus 4.7

Signed-off-by: Taylor Howard <taylorhoward@me.com>
(cherry picked from commit fb9dd4c)
iche033 pushed a commit that referenced this pull request Apr 30, 2026
Repeated unloadEngine/engine cycles dlopen and dlclose the engine
plugin.  Some transitive dependency (libgomp on Ubuntu Noble + rotary
nightly packages) uses thread-local storage; without RTLD_NODELETE,
each reload allocates from glibc's static-TLS surplus, which is not
reliably reclaimed on dlclose.  After ~10 cycles the surplus is
exhausted and the next dlopen fails with
  "cannot allocate memory in static TLS block"

Pass _noDelete=true to gz-plugin's Loader::LoadLib so dlclose keeps
the library mapped.  TLS is allocated once on first load and reused
on every subsequent reload, eliminating the leak.

Trade-off: the plugin and its transitive deps remain resident for
the lifetime of the process.  For a rendering engine this is the
expected lifetime anyway.

Fixes #1265

Generated-by: Claude Opus 4.7

Signed-off-by: Taylor Howard <taylorhoward@me.com>
(cherry picked from commit fb9dd4c)
iche033 pushed a commit that referenced this pull request Apr 30, 2026
Repeated unloadEngine/engine cycles dlopen and dlclose the engine
plugin.  Some transitive dependency (libgomp on Ubuntu Noble + rotary
nightly packages) uses thread-local storage; without RTLD_NODELETE,
each reload allocates from glibc's static-TLS surplus, which is not
reliably reclaimed on dlclose.  After ~10 cycles the surplus is
exhausted and the next dlopen fails with
  "cannot allocate memory in static TLS block"

Pass _noDelete=true to gz-plugin's Loader::LoadLib so dlclose keeps
the library mapped.  TLS is allocated once on first load and reused
on every subsequent reload, eliminating the leak.

Trade-off: the plugin and its transitive deps remain resident for
the lifetime of the process.  For a rendering engine this is the
expected lifetime anyway.

Fixes #1265

Generated-by: Claude Opus 4.7

Signed-off-by: Taylor Howard <taylorhoward@me.com>
(cherry picked from commit fb9dd4c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

REGRESSION_reload_engine_ogre2_gl3plus fails in rotary on github action

3 participants