Skip to content

Development/plugin server service state machine#2162

Draft
bramoosterhuis wants to merge 29 commits into
masterfrom
development/plugin-server-service-state-machine
Draft

Development/plugin server service state machine#2162
bramoosterhuis wants to merge 29 commits into
masterfrom
development/plugin-server-service-state-machine

Conversation

@bramoosterhuis

Copy link
Copy Markdown
Contributor

No description provided.

- Introduced a StateMachine class to manage the lifecycle states of plugins, including DEACTIVATED, PRECONDITION, ACTIVATED, HIBERNATED, and UNAVAILABLE.
- Defined stable and transient states with clear guarantees for callbacks during state transitions.
- Added detailed documentation for state transitions, callback guarantees, reentrancy, and crash handling.
- Refactored existing plugin state handling to utilize the new StateMachine, improving clarity and maintainability.
- Updated the Service class to integrate the StateMachine and handle state change notifications.
- Removed redundant locks and streamlined the Evaluate method to leverage the StateMachine's re-evaluation capabilities.
…sitionSuspender

Replaced the manual lock management inside `TransitionYield` with a private,
nested RAII class `TransitionSuspender` within `StateMachine`.

- Eliminates potential Undefined Behavior (UB) in Release builds where `Unlock()`
  could accidentally be called on an unheld mutex when assertions are disabled.
- Guarantees proper scope-based lock re-acquisition and thread-ID tracking upon
  leaving `Evaluate()`, ensuring state machine consistency.
- Enhances encapsulation by hiding internal transition-locking mechanics from
  the outside world.
Move Suspend and Resume handling completely into the StateMachine to ensure
proper lifecycle tracking and robust thread synchronization.

- Eliminate fragile cross-state delegation from PreconditionState by
  introducing stateless worker methods RequestSuspend() and RequestResume()
  on the Service class.
- Fix potential distributed deadlocks and threadpool starvation by ensuring
  Service::Lock() is released BEFORE invoking the blocking out-of-process
  COM/RPC _stateControl->Request() calls in both suspend and resume flows.
- Protect _stateControl lifecycle during RPC calls by capturing a local
  pointer copy under a short-lived Service::Lock() within the transition thread.
- Prevent ERROR_ILLEGAL_STATE regressions by explicitly overriding Resume
  and Suspend in PreconditionState, ensuring symmetric lifecycle state transitions.
- Resolve reentrancy/ASSERT risks in DeactivatedState::Resume by invoking
  the internal _Activate() trigger directly, keeping the operation safe within
  the existing transition thread context without re-acquiring _transitionLock.
- Enforce architectural invariants by adding ASSERT(IsTransitionThread())
  guards to the new worker methods.
Introduce a 'DestroyedState' tombstone to the StateMachine to resolve
the lifecycle divergence where services could be destroyed while in
an active state.

- Implement StateMachine::Tombstone() to perform a lock-safe, atomic
  transition to the DestroyedState.
- Ensure that Destroy() properly awaits in-flight transitions by
  acquiring the _transitionLock before tombstoning.
- Guarantee consistent state synchronization between the internal
  state machine and the base class enum.
- Ensure all service entry points (QueryInterface, Activate, etc.)
  return safe error codes immediately after destruction.
Enforce consistency by upgrading the atomic load of '_current' from
relaxed to acquire before dispatching the Resume trigger.

- Eliminate an architectural inconsistency that was misleading to readers,
  bringing Resume inline with Suspend, Reevaluate, and QueryInterface.
- Note: The relaxed ordering was technically thread-safe because this load
  always follows a release store on the exact same transition thread, but
  explicit acquire semantics improve code readability and maintainability.
Destroy() is a purely administrative operation — it removes the service
from the map without deactivating it. The caller is responsible for
deactivating the plugin first. Make that precondition explicit so
violations are caught immediately in debug builds.
- Add DESTROYED to the STATES section including tombstone precondition
- Add Resume and Suspend to the public triggers list in REENTRANCY
- Add ServiceMap::_adminLock as level 0 in the lock ordering, above
  _transitionLock, documenting the hold during Tombstone()
- Remove stale reference to TransitionYield, replaced by TransitionSuspender
Both were left over from the original monolithic Activate/Deactivate
methods. Wakeup() is now only reachable via HibernatedState, making
the HIBERNATED guard dead code. State(ACTIVATED) was redundant since
both callers already call SetState(_stateActivated) after Wakeup()
returns. Add ASSERT guards consistent with other work methods.
The re-entrancy guard used a thread id compared against 0 as a "no
thread" sentinel. pthread_t has no portable null value, so this relied
on a glibc/musl assumption. It also lived behind ASSERT, leaving release
builds with no protection at all, while _transitionLock is recursive and
does not block same-thread re-entry on its own.

Split the guard into an active flag plus an owner id so the id never
needs a sentinel, and turn re-entry detection into real control flow:
public triggers now return ERROR_ILLEGAL_STATE on same-thread re-entry
in every build. Wrap ownership in a TransitionScope RAII helper so the
flag is cleared even if a state method throws.
The state checks after HibernateProcess() guarded against a concurrent
Wakeup() or Deactivate() changing state mid-hibernation. Under the
StateMachine model _transitionLock is held for the whole transition, so
no concurrent transition can run and these checks can no longer fire.

In Hibernate() the check was a standalone guard, now an ASSERT so the
invariant stays verifiable in debug without a dead branch or an empty
lock/unlock pair in release. In HibernateChildren() the loop break is
kept as a visible abort path but no longer takes Service::Lock() to read
State(), consistent with the other lock-free State() reads in this file.
The state classes and their base were public but are never named outside
the StateMachine. Make the whole hierarchy private — it is an
implementation detail. SetState() is only called by the state classes, so
move it to private as well. Drop Current(), which had no callers; state is
already reachable through Service::State().

IsTransitionThread() and the triggers stay public: the work methods in the
cpp assert on the former and Service drives the latter. No behaviour
change.
Configure kernel core dumps to /tmp/cores, extract backtraces with gdb,
and upload as artifacts. Use always() condition to capture crashes even
when test runner tolerates failures with || true.
When the Controller deactivates during shutdown, the DEACTIVATED callback
fires a statechange notification after UnloadPlugin() has already released
_handler. Server::Notification then asserted on ClassType<Controller>()
being non-null, which is null at that point, aborting the process.

The function body already guards on controller != nullptr and treats it as
a no-op, so a null handler here is a legal shutdown state, not an error.
Drop that half of the assert and keep only the real invariant
(_controller.IsValid()). StateControlStateChange is left unchanged: it
dereferences controller unconditionally and its callback is unregistered in
Detach() before teardown, so a null there would be a genuine bug.
The || true was added to let the backtrace step run while the test was
crashing during teardown. With the crash fixed and the backtrace step on
always(), it is no longer needed and would mask any future abort by
keeping the job green. Drop it so the job fails honestly on a crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant