Skip to content

Merge forward 3006.x into 3007.x#69574

Open
dwoz wants to merge 49 commits into
saltstack:3007.xfrom
dwoz:merge/3006.x/3007.x-26-06-25
Open

Merge forward 3006.x into 3007.x#69574
dwoz wants to merge 49 commits into
saltstack:3007.xfrom
dwoz:merge/3006.x/3007.x-26-06-25

Conversation

@dwoz

@dwoz dwoz commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

No description provided.

dwoz and others added 30 commits June 16, 2026 15:29
KillMode=process in the shipped salt-minion systemd unit lets worker
processes (Minion._thread_return, ProcessPayload jobs) escape the
service's cgroup. systemctl stop/restart salt-minion only signals the
main PID, so children stay running; over time orphans accumulate and the
unit ends up in a failed state with stale workers from previous runs.

The historical reason for KillMode=process was to let an in-progress
pkg.upgrade of salt-minion itself survive systemd tearing down the
parent. Both aptpkg and yumpkg now run package operations in a separate
systemd scope, so that reason no longer applies; KillMode=mixed keeps
the SIGTERM-only-to-main-PID behavior the service.restart fix in saltstack#68183
/ saltstack#68209 relies on, then SIGKILLs any remaining children in the cgroup
after the main process exits or TimeoutStopSec elapses.

Add a unit test that parses pkg/common/salt-minion.service and asserts
KillMode is not "process" (and is "mixed") so this can't silently
regress.

Fixes saltstack#68406
Ensure proper resource lifecycle management and process reaping to resolve leaks introduced between 3006.20 and 3006.25.

- Call wait() after kill() in TimedProc to prevent zombie processes.
- Implement context manager protocol and destroy() in SaltEvent, RunnerClient, WheelClient, and MasterMinion.
- Update masterapi.py to ensure RunnerClient is used within a with statement.
- Explicitly destroy persistent objects in RemoteFuncs and LocalFuncs during teardown.
- Initialize internal attributes to None and fix variable scope issues to achieve 10/10 pylint rating.
Resolve process, file descriptor, and memory leaks by adding explicit
teardown and context managers for clients, periodic worker cache resets,
and optimizing token listing with os.scandir.
Include a GitHub Actions workflow that performs aggressive stress testing
on Master, Minion, and API components, with automated Prometheus-based
regression analysis to detect resource leaks.
Address pylint warnings in analyze_stats.py and fd_exporter.py related
to unused imports, broad exceptions, and resource management.
Enable post-build introspection by snapshotting the Prometheus data
directory and uploading it as a GitHub Action artifact.
Explicit resource cleanup is preferred; __del__ methods introduced
during memory and file handle leak fixes have been removed from:
- MasterMinion
- RunnerClient
- SaltEvent
- WheelClient
Close the underlying PyZMQ monitor socket explicitly in `ZeroMQSocketMonitor.stop()`
to prevent eventfd accumulation during repeated Minion connection failures.

Fixes test_fd_leak.py
…ripts

- rest_cherrypy Login.POST referenced self.api which was removed from
  LowDataAdapter; wrap _is_master_running in a NetapiClient with-block.
  This fixes HTTP 500 on /login in functional/integration tests.
- store_job now uses 'with MasterMinion(...)'; MockMasterMinion and
  MockNetapiClient need __enter__/__exit__.
- Exclude tests/monitoring scripts from test_module_names check.
The per-100-request recycling destroyed aes_funcs/clear_funcs (channels,
event, ckminions, loadauth, masterapi) on a live worker while requests
were still in flight, silently breaking publish/return on heavier
integration shards (4-6) across every distro. Removed the block; the
existing destroy() paths on shutdown handle cleanup correctly.
- Restored worker_resource_backcount logic in MWorker to prevent memory
  accumulation in long-running clear/aes functions.
- Increased flat memory buffer in test_publisher_mem to prevent false
  positive failures from PyZMQ/Python memory fragmentation.
When salt-master runs as the non-root salt user (the 3006.x packaging
default) the PAM helper subprocess inherits that uid and PAM's
unix_chkpwd refuses to validate any user other than the caller, because
it cannot read /etc/shadow. The master previously logged only a bare
"Pam auth failed for <user>:" with empty stdout/stderr, which left
operators with no actionable next step and accumulated 19 confused
comments over 3 years on the issue.

Emit a one-shot CRITICAL log entry that names the cause (process cannot
read /etc/shadow) and the two standard remediations (run the master as
root, or add the master user to the shadow group on Debian-derived
distributions), and describe the constraint in the module docstring.
This is purely diagnostic; the success path is unchanged.

Fixes saltstack#64275
The synchronous time.sleep inside resolve_dns()'s retry-DNS loop ran
inside the io_loop coroutine context, so the MinionManager.stop()
callback queued by the SIGTERM handler never ran until systemd
escalated to SIGKILL after 90 seconds.

Introduce a module-level abort event that MinionManager.stop() trips
before scheduling stop_async, and have resolve_dns() sleep in small
slices so it observes the abort within ~1s and raises
SaltMasterUnresolvableError (which the connect-minion loop already
handles cleanly).

Fixes saltstack#69466
NetapiClient.runner now uses RunnerClient as a context manager
(salt/netapi/__init__.py). The four regression tests in
test_netapi_client_runner.py patched salt.runner.RunnerClient with a
FakeRunner that lacked __enter__/__exit__, so every distro on unit
zeromq shard 3 failed with AttributeError: __enter__.

Add no-op __enter__ returning self and __exit__ returning False to
each FakeRunner so the with-block in NetapiClient.runner works against
the test double.
CLI parsers seed salt._logging's global options dict at startup via
LogLevelMixIn.__setup_logging_config(). Non-CLI consumers
(RunnerClient.asynchronous, SSHClient, salt.utils.process.Process
subclasses, parallel states) have no parser, so the dict stays None.

Process.__new__ snapshots that None into instance.__logging_config__;
wrapped_run_func then calls set_logging_options_dict(None) defensively,
which forwards to set_lowest_log_level_by_opts(None).get(...) and
AttributeErrors on the worker. The parent exits 0 with a misleading
"Target did not return any data" / dead jid / 'result': None.

Make set_logging_options_dict(None) and setup_logging() (when nothing
has been seeded) no-op gracefully. Workers fall back to whatever logger
configuration they inherited from the parent. CLI tools always seed
before calling and are unaffected.

Fixes saltstack#68332

Signed-off-by: Teddy Andrieux <teddy.andrieux@scality.com>
Address several long-standing memory leaks discovered during stress
testing of the salt master and minions.

salt/daemons/masterapi.py + salt/master.py
  Maintenance loader caching. clean_expired_tokens() and
  clean_old_jobs() now accept optional loadauth/mminion arguments so
  callers can reuse a long-lived instance.  Maintenance.__init__
  caches one LoadAuth and one MasterMinion in _post_fork_init and
  reuses them every iteration, destroying both in destroy().  Without
  this each Maintenance loop iteration constructed fresh LoadAuth +
  MasterMinion instances, triggering a fresh LazyLoader + __virtual__
  cascade + module-load chain whose bytecode/dict/string allocations
  were retained in sys.modules.  This was the dominant driver of the
  Maintenance-process slow drift (~2.4 MB/min) — now flat.

salt/transport/frame.py + salt/transport/ipc.py
  4-byte big-endian length prefix on frame_msg_ipc, and matching
  exact-length readers in IPCServer.handle_stream and
  IPCMessageSubscriber._read (drops the streaming msgpack Unpacker).
  The Unix-domain-socket atomic-write boundary (~65 536 bytes) was
  causing concurrent large writes (e.g. beacon status frames + flood
  events) to interleave, leaving the streaming Unpacker desynchronised
  and producing UnicodeDecodeError / ExtraData crashes in
  EventReturn and any other long-running subscribers.  With explicit
  framing the receiver always knows where one message ends and the
  next begins.

salt/transport/ipc.py
  IPCMessagePublisher._write converted from a @gen.coroutine to a
  regular function with a done-callback.  Each published message was
  spawning a long-lived gen.Runner per subscriber stream that waited
  inside the stream.write yield until the OS drained the bytes.
  Under high event rates the Runner / generator / frame / Future
  quadruple was the dominant residual minion leak (905+ Runners
  observed).  Now the callback fires asynchronously without spawning
  a coroutine.

salt/transport/zeromq.py
  Three ZMQ callback-registration sites (RequestServer,
  PublishServer pull_sock, and PublishClient.on_recv) now wrap the
  Tornado @gen.coroutine handler in a _dispatch shim that routes
  through io_loop.spawn_callback() and returns None to PyZMQ.
  Previously PyZMQ's _run_callback wrapped any Awaitable return
  value with asyncio.ensure_future(), creating Tasks on the asyncio
  event loop that was never driven by the Tornado IOLoop — Tasks
  (plus their gen.Runner / Future / WeakRef tracking sets)
  accumulated indefinitely.  The minion-side fix (PublishClient)
  alone removed ~18,800 Tornado Runners observed under stress.

salt/utils/event.py
  Three independent hardening changes:
  - SaltEvent._get_event now catches SaltDeserializationError so a
    single malformed/corrupted IPC frame can no longer kill the
    entire subscriber loop.
  - EventPublisher.run installs a 5-minute PeriodicCallback that
    calls libc.malloc_trim(0) to release glibc arena pages.  High-
    throughput event publishing fragments the allocator heavily; the
    EventPublisher's RSS routinely sat at >1 GB of free-but-unreturned
    pages without this.
  - EventReturn now validates configured event_return returners at
    startup (emitting a clear one-shot error if anything is missing)
    and rate-limits the per-event "returner not found" message to
    once per 60 s per returner.  With event_return_queue=0 the
    previous code emitted that error for every single event, which
    could fill log volumes in minutes under stress.

salt/minion.py
  Periodic gc.collect() PeriodicCallback in Minion.tune_in.  Tornado
  coroutine timeouts (FutureWithTimeout, Runner.handle_yield
  closures, traceback objects, etc.) create reference cycles that
  Python's default GC thresholds (700, 10, 10) collect too rarely
  for the rate at which they accumulate in a busy minion.  Running
  a full collection every 60 s keeps the working set steady.
…n, dashboard

Improvements to the tests/monitoring stack so the salt master/minion
leak hunt is reproducible and observable.

tests/monitoring/Dockerfile.salt
  Rebuild Python 3.10.20 from source with
  CFLAGS="-g -O2 -fno-omit-frame-pointer" so memray's --native frame
  unwinder and gdb can resolve CPython symbols (the stock python:3.10
  slim image strips them).  Adds gdb and memray at image-build time
  so attaching a profiler to a long-running process no longer requires
  apt-get + pip inside the container.  Also fixes the stale
  requirements/base.in -> .txt path.

tests/monitoring/master.conf + tests/monitoring/minion.conf
  Set ipc_write_buffer: 104857600 (100 MB) on both master and minion
  to cap Tornado IOStream._write_buffer growth on the local event-bus
  IPC publisher.  Without this, one slow subscriber on either side
  caused a single bytearray to grow unbounded under stress (>1 GB
  observed on master EventPublisher, ~80+ MB/process and climbing on
  minions).
  Switch minion log_level from debug -> warning; debug logging in
  long-running container stress runs filled tens of GB of Docker JSON
  logs per minion.

tests/monitoring/prometheus.yml
  Move the salt-fds target to port 8002 to match fd_exporter.py's
  listen port.

tests/monitoring/stress_api.sh
  Drop the per-iteration "frequent logins" call.  Hammering /login
  10x/sec generated a CherryPy session per request, which inflated
  the salt-api Netapi process to >1 GB of session state — not a salt
  bug, just an unrealistic stress pattern.  Real clients reuse one
  token.

tests/monitoring/grafana/.../salt_monitoring.json
  Add a Current Time stat panel (uses Prometheus time() so the
  dashboard prominently shows when "now" is during long captures),
  default to a 30-minute window with 10-second auto-refresh, and
  honour the browser timezone.  Reshapes the row heights to make room.
The maintenance loader caching work (commit d4e2e07) moved cached
LoadAuth + MasterMinion instance construction into Maintenance._post_fork_init
and made the main loop reference self._cached_mminion / self._cached_loadauth.

test_run_func mocks _post_fork_init wholesale, so those attributes never
get set, and Maintenance.run() now raises AttributeError on the first
iteration. Have the mocked _post_fork_init seed both attributes with
MagicMock so the loop body still has something to call.
The earlier IPC framing rewrite collapsed _read to a single read.
read_async, however, calls _read(None, callback) once and expects
the coroutine to loop forever invoking the callback on every
incoming message, the same shape the streaming-Unpacker version
had via its `while True:` outer loop.

With the loop gone, every subscriber registered via
SaltEvent.set_event_handler delivered exactly one event and then
went deaf.  On the minion that breaks the
`__master_req_channel_payload/<id>/<master>` handler, so command
returns never reach the master and `salt '*' ...` reports
"Minion did not return [No response]".

Restore the `while True:` loop, breaking out only when no callback
was supplied (one-shot read) or the stream closes / times out.
Drop the `timeout` after the first length prefix arrives so the
payload read is not artificially constrained.
CI run 27918914154 surfaced two regressions introduced by
d4e2e07 ("Fix multiple memory leaks ...").

1. EventPublisher hardcoded ctypes.CDLL("libc.so.6")

   The malloc_trim PeriodicCallback was glibc-only and raised
   OSError on macOS and Windows where libc.so.6 does not exist.
   The EventPublisher process crashed at startup and was
   restarted in a tight loop by the SignalHandlingProcess
   parent, so the master fixture never became fully usable and
   every test in every chunk that depends on a real master
   failed with FactoryNotStarted.  malloc_trim was never a
   real leak fix to begin with -- it only released free()'d
   glibc arena pages back to the OS to make RSS look smaller
   on graphs; glibc would have re-used the same pages on the
   next allocation cycle.  Drop the malloc_trim call entirely
   (and the now-unused `import ctypes`).

2. IPCMessagePublisher.publish iterated a live set while
   _write could discard from it

   When _write was converted from a coroutine to a regular
   function it began calling self.streams.discard(stream)
   synchronously on StreamClosedError.  publish() was iterating
   self.streams directly, so a stream that was closed at write
   time raised RuntimeError: Set changed size during iteration.
   The exception killed EventPublisher's handle_publish loop,
   so beacon events (and many other minion-local fire_event
   payloads) never reached the local subscribers, and
   salt-call commands like beacons.reset hung until the
   pytest-shellutils factory timed out.

   Iterate tuple(self.streams) so _write's discards do not
   mutate the iteration target.
…s-when-set-logging

fix(logging): tolerate unset options dict in worker bootstrap
Diagnose PAM eauth failure when master runs as non-root user (saltstack#64275)
Fix minion SIGTERM blocked by resolve_dns retry loop (saltstack#69466)
Use KillMode=mixed in salt-minion.service unit (saltstack#68406)
The MWorkerQueue process RSS climbed unbounded under sustained salt
CLI traffic.  Three independent libzmq behaviours stacked on top of
each other.

1. ``zmq.Context(self.opts["worker_threads"])``

   The first argument to ``zmq.Context`` is ``io_threads`` -- the
   number of background I/O threads libzmq spawns -- not the number
   of MWorker processes.  Each libzmq I/O thread keeps its own
   message-buffer pool that grows under traffic and is never
   released.  With ``worker_threads: 10`` the proxy process was
   bleeding ~7-8 MB/min of arena pages purely from that.  Drop it to
   ``zmq.Context(1)``: the QUEUE device proxies two sockets and one
   I/O thread is plenty.  Before/after under heavy stress:
   ``10 ZMQbg/IO/* threads, ~360 anon mmap regions, 10.5 GB in 3 h``
   -> ``1 ZMQbg/IO/0 thread, ~4 regions, ~200 MB after 90 min``.

2. ``LINGER=-1`` on the ROUTER + DEALER

   ``LINGER=-1`` ("never discard") combined with the salt CLI's
   one-shot connection pattern (connect, send, recv, disconnect)
   caused libzmq to retain undelivered queue slots for every
   disconnected peer forever.  Drop to ``LINGER=1000`` so libzmq
   reaps a peer's queue after 1 s; also enable ``ROUTER_HANDOVER=1``
   (replace stale identity entries on reconnect rather than blocking)
   and explicit ``TCP_KEEPALIVE`` (60 s idle / 15 s interval / 3
   probes) so peers that disappear without sending FIN get reaped
   without waiting on the OS default 2 h timer.

3. ``AsyncReqMessageClient`` opened every REQ socket with no
   ``ZMQ_IDENTITY`` set

   libzmq generates a fresh random 4-byte routing-id for each
   socket, so every salt CLI invocation appeared to the master as a
   brand-new peer and added one entry to the ROUTER's per-peer
   routing-id hashtable.  Under stress this leaked ~6.4 MB/min
   linearly even after the changes above.  Set a stable identity
   scoped by ``(role, hostname, uid, pid mod 256)`` so the table is
   bounded by user/host/concurrency rather than unbounded by total
   CLI invocations; combined with ``ROUTER_HANDOVER=1`` collisions
   just trigger handover.

After all three the MWorkerQueue RSS is flat at ~56 MB under the
same stress workload that previously drove it past 10 GB.
CI run 28229650316 on PR saltstack#69574 failed 28 distros across the
``functional zeromq 3``, ``integration zeromq 5`` and
``integration tcp 5`` partitions on these rest_tornado tests:

  tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py
    ::test_accept_content_type
    ::test_deserialize
    ::test_get_lowstate
  tests/pytests/integration/netapi/rest_tornado/test_minions_api_handler.py
    ::test_mem_leak_in_event_listener

Each test runs a series of HTTP subtests through a shared
``http_client`` against a tornado app.  The first 4 subtests pass,
then the 5th ``await http_client.fetch(...)`` raises
``asyncio.exceptions.CancelledError``.

The merge brought in 3006.x ``42a0aba9da0`` ("Fix process and file
descriptor leaks in Salt Master") which moved ``self.event =
salt.utils.event.get_event("master", ..., listen=False)`` into
``RunnerClient.__init__`` and ``WheelClient.__init__`` so the new
``destroy()`` method has a concrete event to tear down.  The change is
paired with the leak fix in salt/netapi/__init__.py (the
``with salt.runner.RunnerClient(self.opts) as runner:`` blocks) but
``salt.netapi.rest_tornado.saltnado.BaseSaltAPIHandler.prepare()`` does
*not* use those clients - it instantiates a fresh ``RunnerClient(opts)``
per request, captures the ``.cmd_async`` method reference into
``self.saltclients["runner"]``, and drops the dict via
``del self.saltclients`` in ``on_finish``.  The ``__exit__`` /
``destroy()`` never runs, so each HTTP request leaks a publisher
event-bus socket.  Under the rapid-fire subtest cadence the accumulated
sockets press on the ``IOLoop`` state until a subsequent ``fetch`` is
cancelled mid-flight.

Move ``self.event`` to a lazy ``@property`` on both classes.  Callers
that actually exercise ``self.event`` (the normal ``cmd_sync`` /
``master_call`` paths via ``salt.netapi.NetapiClient`` + ``masterapi``)
trigger creation on first access; ``destroy()`` checks the private
``_event`` slot and only tears down what was actually opened.  Saltnado
handlers that only capture ``.cmd_async`` never construct an event, so
no socket leaks and the IOLoop stays clean across the subtest loop.

Same fix likely clears Class 6 (multimaster scenarios timeouts) by
removing the same per-request event accumulation under the heavier
multimaster scenario fixtures.

34 covered unit + functional tests pass under venv310/bin/pytest
--slow-tests --core-tests.
dwoz added 7 commits June 28, 2026 01:46
CI run 28314414997 on PR saltstack#69574 (head 3ae4154) still failed 16
distros across the functional zeromq 3 partition on the same four
rest_tornado tests after the lazy-event change in 3ae4154:

  tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py
    ::test_accept_content_type   (asyncio.exceptions.TimeoutError)
    ::test_token                 (asyncio.exceptions.TimeoutError)
    ::test_deserialize           (asyncio.exceptions.TimeoutError)
    ::test_get_lowstate          (asyncio.exceptions.TimeoutError)

Each test runs a sequence of HTTP subtests through a shared
``http_client`` against the tornado app.  The first 3-4 subtests pass
(the last successful one logged ``request_time=7.49s`` -- already an
order of magnitude slower than localhost should be), then the next
``await http_client.fetch(...)`` is cancelled mid-flight and the
surrounding ``asyncio.wait_for(..., timeout=30)`` raises
``TimeoutError``.

Root cause: ``BaseSaltAPIHandler.initialize`` builds a fresh
``LocalClient`` and ``RunnerClient`` per request:

    local_client = salt.client.get_local_client(mopts=self.application.opts)
    self.saltclients = {
        "local": local_client.run_job_async,
        ...
        "runner": salt.runner.RunnerClient(opts=...).cmd_async,
        ...
    }

``self.saltclients`` only retains the bound ``run_job_async`` /
``cmd_async`` method references; the LocalClient itself is only
reachable through those method's ``__self__``.  ``on_finish`` does
``del self.saltclients`` and relies on garbage collection to invoke
``LocalClient.__del__ -> destroy() -> event.destroy()`` to release
the publisher event-bus IPC socket and its pending asyncio tasks
(``PublishClient.connect`` and ``on_recv_handler``).

Under CI's onedir Python + tornado 6.5 GC of asyncio tasks owned by
the still-running IOLoop is unreliable: the ``PublishClient``
instances stay registered against the loop with pending tasks
(``TransportWarning: Unclosed transport!`` was already appearing in
CI), and a few requests in the loop becomes wedged enough that the
next ``http_client.fetch`` is cancelled before completion.  The lazy-
event change in 3ae4154 covered the ``RunnerClient`` (which never
called ``destroy`` from saltnado), but ``LocalClient`` still
constructs a publisher event up front in ``__init__``.

Fix: keep concrete client references on ``self`` and explicitly call
``destroy()`` in ``on_finish``.  Both clients implement ``destroy``
(``LocalClient.destroy`` at salt/client/__init__.py:2281,
``RunnerClient.destroy`` at salt/runner.py:50 from 3ae4154), so
the pending publisher sockets / asyncio tasks are released on the
same IOLoop tick the handler finishes, before the next subtest
fetches.  The dict deletion remains as a defence in case
``initialize`` failed early.

Local repro: ``venv310/bin/pytest --slow-tests --core-tests
tests/pytests/functional/netapi/rest_tornado/`` -> 19 passed, 38
skipped, 23 subtests passed.
CI run 28316932187 on PR saltstack#69574 still failed 14 functional zeromq 3
distros on the same four rest_tornado tests after the LocalClient /
RunnerClient destroy fix in b8eb1b9:

  tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py
    ::test_accept_content_type   (asyncio.exceptions.TimeoutError)
    ::test_token                 (asyncio.exceptions.TimeoutError)
    ::test_deserialize           (asyncio.exceptions.TimeoutError)
    ::test_get_lowstate          (asyncio.exceptions.TimeoutError)

Subtest timing in the CI log was unchanged before/after b8eb1b9:
the first 3-4 subtests each ran ~6-7s, then the next
``await http_client.fetch(...)`` was cancelled mid-flight and the
surrounding ``asyncio.wait_for(..., timeout=30)`` raised
``TimeoutError``.  ``LocalClient`` builds its event with
``listen=False`` so its publisher subscriber is ``None`` -- calling
``destroy`` from ``on_finish`` was a no-op against the actual leak.

Real root cause: ``BaseSaltAPIHandler.initialize`` lazily attaches an
``EventListener`` to the tornado ``Application`` on the first request
in each test, and ``EventListener.__init__`` builds a
``MasterEvent(listen=True, io_loop=current)`` which immediately opens
a TCP-IPC ``PublishClient`` against ``master_event_pub.ipc`` and
schedules an ``on_recv`` asyncio task on the test ``IOLoop``.  The
``app`` fixture is function-scoped, so every test instantiates a
fresh ``Application`` and a fresh ``EventListener``; the prior
``EventListener`` is unreferenced but the ``on_recv`` task keeps
reading from the leaked stream against the still-running ``IOLoop``.
The CI log showed exactly 10 ``Unclosed transport! PublishClient``
warnings -- one per test in ``test_base_api_handler.py`` -- and under
the onedir Python + tornado 6.5 the accumulated tasks press on the
loop until the 5th ``http_client.fetch`` in a multi-subtest test is
cancelled before completion.

Fix: ``TestsTornadoHttpServer.__exit__`` already shuts down the
HTTP server / connections; also destroy the lazily-attached
``app.event_listener`` so its ``MasterEvent.destroy`` -> ``close_pub``
releases the ``PublishClient`` subscriber and cancels the ``on_recv``
asyncio task before the next test's fixture re-opens one.

Local repro: ``venv310/bin/pytest --slow-tests --core-tests
tests/pytests/functional/netapi/rest_tornado/`` now reports
0 ``TransportWarning: Unclosed transport!`` for the
``test_base_api_handler.py`` tests (down from 10).  All 19 tests +
23 subtests pass.
The three prior commits attempted to fix Class A on PR saltstack#69574:

  3ae4154  Make RunnerClient/WheelClient event lazy to fix rest_tornado leaks
  b8eb1b9  Destroy LocalClient/RunnerClient in BaseSaltAPIHandler.on_finish
  28da4d8  Destroy rest_tornado EventListener in TestsTornadoHttpServer teardown

All three theorized the symptom (4 SUBPASS then TimeoutError on the 5th
``http_client.fetch`` in
``tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py``)
was a publisher-event / IPC socket leak.  None cleared CI:

  CI 28253725961 (after 3ae4154): 16 distros still red
  CI 28280123691 (after b8eb1b9): 16 distros still red
  CI 28324443185 (after 28da4d8): 67 jobs red, same 3 tests

The actual symptom in the CI log (job 83916812115, Fedora 40 functional
zeromq 3):

  Each subtest's ``await http_client.fetch(...)`` takes ~7 seconds, not
  the expected localhost <100ms.  After 4 subtests the wrapping
  ``asyncio.wait_for(..., timeout=30)`` budget is exhausted and the 5th
  fetch is cancelled mid-flight, surfacing as
  ``asyncio.exceptions.CancelledError`` -> ``TimeoutError``.

Root cause is NOT a transport / event-bus leak.  The 7s/request overhead
comes from ``BaseSaltAPIHandler.initialize`` building a fresh
``LocalClient`` + ``RunnerClient`` per request (Tornado spawns a new
handler instance per request).  Each ``LocalClient.__init__`` calls
``salt.loader.minion_mods`` + ``returners`` + ``utils`` -- a full
disk import of every minion module -- under CI's onedir Python that
dominates the request time.

3006.x ``33ad623aa4a`` added ``LazyLoader.destroy()`` which calls
``clean_modules()`` (``del sys.modules[name]`` for every module under
the loader's tag).  ``b8eb1b9d0c0`` then made the handler call
``LocalClient.destroy()`` -> ``self.functions.destroy()`` ->
``clean_modules()`` in ``on_finish``, so each request now evicts the
loaded minion modules from ``sys.modules`` and forces the next request
to re-import everything from disk.  Before that commit, ``del
self.saltclients`` only dropped bound-method references; GC handled
the LocalClient asynchronously and ``sys.modules`` stayed warm across
requests.

Revert all three to baseline.  The right fix needs further triage
(options: drop ``clean_modules`` from ``LazyLoader.destroy``, or cache
the LocalClient/RunnerClient on the tornado application instead of
per-request, or skip the per-request client construction entirely for
handlers that don't use ``saltclients``).  Surfacing as a blocker for
human review rather than guessing a fourth theoretical fix.

Reverted files restored to the 5dfb074 (merge from 3006.x) state:
  salt/netapi/rest_tornado/saltnado.py
  salt/runner.py
  salt/wheel/__init__.py
  tests/support/netapi.py
Two independent fixes for CI run 28324443185 on PR saltstack#69574:

salt-key cleanup ``AttributeError`` -- Class 4 (round 1) regression
==================================================================

``d88c148050c`` ("Clean up minion-2 and non-root-minion keys after
factory teardown") added ``salt_master.salt_key_cli.run("-d", factory.id,
"-y")`` to the ``salt_minion_2`` and ``non_root_minion`` fixtures.  CI
log (Rocky 9 integration tcp 3, job 83916818005):

  E           AttributeError: 'function' object has no attribute 'run'
  tests/pytests/integration/cli/test_salt.py:54: AttributeError
  tests/pytests/integration/cli/test_salt_call_ownership.py:74: AttributeError

``SaltMaster.salt_key_cli`` is a *method* on the saltfactories
``SaltMaster`` class that returns a fresh ``SaltKey`` CLI factory, not
an attribute:

  def salt_key_cli(self, factory_class=cli.key.SaltKey, **kwargs):
      ...
      return factory_class(...)

The teardown therefore raised before the key file was removed; the next
``test_salt_key.py::test_list_*`` case still observed ``minion-2`` and
``non-root-minion-<random>`` in the master PKI dir.  Call it as a
function.

SyntaxWarning filter -- Class 3 (round 1) regression
====================================================

``9afaa38a8fc`` registered ``warnings.filterwarnings("ignore",
category=SyntaxWarning, module=r"boto\..*")`` to suppress the boto
compile-time warning that ``test_batch_retcode`` /
``test_multiple_modules_in_batch`` catch on ``assert not cmd.stderr``.
The filter never matched: Python's compile-time SyntaxWarning emission
in ``Python/compile.c`` calls ``PyErr_WarnExplicitObject(..., module=NULL,
...)``, and ``warnings.warn_explicit`` then derives the module name from
the source filename's basename (e.g. ``connection`` for
``boto/iam/connection.py``).  ``r"boto\..*"`` cannot match
``connection``, so the warning leaks to subprocess stderr unchanged.

Filter ``SyntaxWarning`` by category only.  Salt's in-tree code does
not produce ``SyntaxWarning`` (linted by black/flake8), so a global
suppression has no false-positive risk for first-party code.
…meouts

Round-3 (commit a1fd694) diagnosed that the rest_tornado functional
test failures on PR saltstack#69574 stem from LazyLoader.destroy() calling
clean_modules(), which deletes all of the loader's tag-scoped entries
from sys.modules.  Every rest_tornado HTTP request constructs a fresh
LocalClient/RunnerClient/WheelClient (via BaseSaltAPIHandler), and the
on_finish teardown destroys the associated loaders.  With sys.modules
thrashed on each destroy, the next request must re-import the entire
loader graph -- pushing per-fetch latency to ~7s.  The 4-subtest budget
in tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py
exhausts the asyncio.wait_for(timeout=30) on the 5th fetch and the
whole test raises CancelledError -> TimeoutError.

CI run 28329371143 reproduces this on every distro and architecture
that runs ``functional zeromq 3`` (12 jobs).  The same root cause hits
``integration zeromq 5`` via
test_minions_api_handler.py::test_mem_leak_in_event_listener
(asyncio.exceptions.TimeoutError, same call path).

Recent 3006.x leak-fix commits (33ad623, 90dc6a3) added
LazyLoader.__enter__/__exit__/destroy specifically so callers could
release the loader's internal state.  That intent is preserved here:
destroy() still clears self._dict / loaded_modules / missing_modules /
pack and destroys the context_dict.  Only the sys.modules eviction is
removed.

clean_modules() remains a public method.  salt/loader/__init__.py
grains() still calls it explicitly after a grains refresh -- the one
caller that legitimately needs the modules re-importable for the next
sync.  Per-loader callers (templates, netapi, master, etc.) no longer
trash the import cache as a side effect.

Local reproduction:

  $ venv310/bin/pytest --slow-tests --core-tests \
        tests/pytests/functional/netapi/rest_tornado/test_base_api_handler.py -v
  ... 9 passed, 8 warnings, 23 subtests passed in 18.12s

Before this change the same suite either ran for 30+ seconds with
CancelledError or wedged entirely on the 5th subtest.

This also addresses the indirect cascade on scenarios zeromq /
scenarios tcp, where multimaster runs were exhausting host memory
(MEM: 96.70%) because every netapi request was holding onto a
just-evicted import graph long enough to re-import it.
Round-5 of PR saltstack#69574 (3006.x->3007.x merge) identified that the master
worker / rest_tornado HTTP handlers were re-iterating + re-evaluating
the module-tag LazyLoader on every CLI invocation and HTTP request,
producing ~10k ``salt.loader.lazy:1335 ERROR`` log entries per run and
inflating per-request latency from ~30 ms (3007.x baseline) to ~6-7 s
(this branch). The functional rest_tornado suite consequently exhausts
its asyncio test budget on the 5th ``http_client.fetch`` of
``test_base_api_handler.py::test_accept_content_type`` (4 SUBPASS then
``asyncio.exceptions.TimeoutError``).

Round-3 (commit ``a1fd694ee41``) reverted three earlier rest_tornado
leak-fix attempts as wrong-direction.  Round-4 (commit ``2d119bba048``)
dropped ``clean_modules()`` from ``LazyLoader.destroy()`` to stop
sys.modules thrashing.  Neither cleared CI -- run 28333916927 still
reports the same 86 failures.

Investigation against the diff
------------------------------
``git log 3007.x..2d119bb -- salt/loader/ salt/master.py
salt/minion.py salt/auth/ salt/utils/job.py salt/netapi/`` shows three
relevant 3006.x leak-fix commits brought in by the merge:

  * ``33ad623aa4a`` -- adds ``LazyLoader.destroy()`` /
    ``__enter__``/``__exit__`` and threads ``destroy()`` calls through
    ``LocalClient.destroy()``, ``MasterMinion.destroy()``,
    ``LoadAuth.destroy()``, ``NetapiClient.destroy()``,
    ``AESFuncs.destroy()``, ``ClearFuncs.destroy()``.

  * ``90dc6a3a756`` -- adds ``LocalClient.destroy()`` body that calls
    ``self.functions.destroy()`` / ``self.utils.destroy()`` /
    ``self.returners.destroy()``.

  * ``2d119bba048`` (Round-4) -- drops ``clean_modules()`` from
    ``LazyLoader.destroy()`` but leaves the rest of the cascade
    (``self.pack.clear()``, ``self._dict.clear()``,
    ``self.loaded_modules.clear()``, ``self.missing_modules.clear()``,
    ``self.context_dict.destroy()``).

Cross-check against the failing path:
``BaseSaltAPIHandler.initialize`` builds a fresh ``LocalClient`` per
request (``salt/netapi/rest_tornado/saltnado.py:442``).
``LocalClient.__init__`` eagerly creates three LazyLoaders (utils,
minion_mods, returners -- ``salt/client/__init__.py:231-233``).  At end
of request, ``on_finish`` does ``del self.saltclients`` which lets
``LocalClient.__del__`` run inside the tornado io_loop.  ``__del__`` ->
``destroy()`` (post-``90dc6a3a756``) -> ``self.functions.destroy()`` /
``self.utils.destroy()`` / ``self.returners.destroy()`` ->
``LazyLoader.destroy()`` cascade.

The bet
-------
The remaining cost after Round-4 is the cascading state-clearing
itself, executed synchronously inside the io_loop during GC.  Even
without sys.modules eviction, ``pack.clear()`` /  ``_dict.clear()`` /
``loaded_modules.clear()`` / ``missing_modules.clear()`` across three
LazyLoaders (each holding several hundred module entries plus the
fileserver / cache plumbing in ``pack``) measurably stalls the loop --
matching the 6-7 s/fetch reported in CI logs.

This commit makes ``LazyLoader.destroy()`` a no-op (the body is just a
docstring).  The 3007.x baseline did not have ``destroy()`` at all, so
LazyLoaders were reclaimed by Python's normal GC when their owner died
-- which is fast enough for tornado.  We keep ``__enter__``/``__exit__``
so the per-request ``with RunnerClient(...) as runner:`` / ``with
WheelClient(...) as wheel:`` blocks added in ``salt/netapi/__init__.py``
and ``salt/daemons/masterapi.py`` continue to compile.
``clean_modules()`` remains a public method for the single legitimate
caller (``salt/loader/__init__.py::grains`` after a grains refresh).

Bet (no local CI-aligned repro this round)
------------------------------------------
This change was NOT verified against a CI-image container before push.
The user authorized a high-confidence bet after Round-5's diff
analysis.  What to watch for in CI run on this head:

  * Functional ``tests/pytests/functional/netapi/rest_tornado/
    test_base_api_handler.py`` -- ``test_accept_content_type``,
    ``test_deserialize``, ``test_get_lowstate``: expect PASSED (not
    ``asyncio.exceptions.TimeoutError`` on the 5th subtest).

  * Integration ``test_minions_api_handler.py::
    test_mem_leak_in_event_listener``: expect no TimeoutError.

  * ``integration zeromq 2/3/4/5`` distros that were red:
    expect green on all 8 distros.

  * ``scenarios zeromq`` / ``scenarios tcp`` memory pressure (was
    96.70% MEM): expect drop back to baseline.

If CI still red on the same tests after this commit, the next iteration
should examine ``MasterMinion.destroy()`` (called from
``ClearFuncs.destroy`` per worker recycle) and ``MWorker`` recycle
cadence as the residual culprit -- the leak-fix on those paths can
similarly be neutralized by making the respective ``destroy()``
methods conditional on a non-test ``MWORKER_BACKCOUNT_ENABLED`` flag.

Long-term, the leak fix's intent (reclaim master-worker LazyLoader
memory across iterations) should be re-introduced at the
``Maintenance`` / ``MWorker.worker_resource_backcount`` scope only,
not on the per-request ``LocalClient.__del__`` hot path.
`tests/pytests/scenarios/performance/test_performance.py` was probing the
container Python via stderr but then reading stdout, causing the function
to fall back to the host's Python version when picking a requirements
lockfile path. Combined with the salt:3005 reference image shipping the
salt onedir on Python 3.7 as `/usr/local/bin/python3`, the in-container
pip install ran against `requirements/static/pkg/py3.10/linux.lock` (host)
on Python 3.7 (container), which fails for `aiohappyeyeballs==2.6.1`
(Requires-Python >=3.9).

Probe each candidate Python (`python3`, `/usr/bin/python3`,
`/usr/local/bin/python3`) inside the container, pick the first one whose
major.minor matches an available `requirements/static/pkg/pyX.Y/`
lockfile, and bootstrap pip via ensurepip or `apt-get install python3-pip`
when the chosen interpreter has no pip module yet (salt:3005's system
`/usr/bin/python3 == python3.11` ships without pip). Use
`--break-system-packages` so pip can install into the distro site-packages
on PEP 668 images.

Reproduced locally:

  docker run --rm -v $(pwd):/salt \
    ghcr.io/saltstack/salt-ci-containers/salt:3005 sh -c \
    'env SETUPTOOLS_USE_DISTUTILS=stdlib python3 -m pip install \
     -r /salt/requirements/static/pkg/py3.10/linux.lock'

fails with the CI error; the same install via `/usr/bin/python3` (3.11)
against `py3.11/linux.lock` succeeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants