Add scheduled repair concurrency scenario for DatacenterAware multi-agent tests by 0rlych1kk4 · Pull Request #1401 · Ericsson/ecchronos

0rlych1kk4 · 2026-02-21T07:09:07Z

Summary

This PR is a focused follow-up to #1398 and adds the scheduled-repair concurrency scenario for DatacenterAware multi-agent tests.

Key Changes

Adds a scheduled repair concurrency scenario for DatacenterAware multi-agent tests.
Introduces opt-in schedule overrides in ecc_config.py so existing tests remain unaffected.
Configures fast schedules explicitly for this scenario.

Test Stability Adjustments

The small updates in TestConfigRefresher and TestScheduleManager are included only to make the scheduling/concurrency path deterministic and avoid race-based failures during test execution.

These adjustments do not change production logic and only affect test reliability.

Scope

The changes are intentionally narrow and extend the DatacenterAware multi-agent testing introduced in #1398.

0rlych1kk4 · 2026-02-21T13:11:37Z

Hi @VictorCavichioli,
This PR adds the scheduled-repair variant of the DatacenterAware multi-agent lock concurrency scenario discussed in #1382 .
It mirrors the on-demand test but runs scheduled repairs with different intervals per instance (dc1 / dc2) to exercise lock behavior under schedule-driven execution.
Schedule configuration is scoped to the test instances so existing scenarios remain unaffected.
Build, standalone integration tests, and license checks pass locally.
Please let me know if you’d like additional timing combinations or edge cases covered.

0rlych1kk4 · 2026-02-22T09:43:34Z

Follow-up pushed to make TestScheduleManager idle-state validation deterministic by removing the scheduler thread startup in testGetCurrentJobStatusNoRunning().
The test now verifies the idle contract directly and avoids a timing race.
Verified locally:
mvn -pl core.impl test "-Dtest=TestScheduleManager" (all green).
Changes are test-only and focused on stability (no production impact).

VictorCavichioli · 2026-03-13T12:35:22Z

Hi @0rlych1kk4 , the main context of this test was merged, so feel free to rebase your change with agent/master branch so we can merge yours as well

0rlych1kk4 · 2026-03-13T13:57:10Z

Thanks, @VictorCavichioli . I’ll rebase this branch onto agent/master, resolve the remaining conflict, rerun the affected checks, and push an updated version.

0rlych1kk4 · 2026-03-14T03:55:43Z

Hi @VictorCavichioli , rebased on top of agent/master and updated the branch. Please let me know if you’d like any further adjustments.

codecov-commenter · 2026-03-17T11:41:11Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (agent/master@c6467f4). Learn more about missing BASE report.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@               Coverage Diff               @@
##             agent/master    #1401   +/-   ##
===============================================
  Coverage                ?   54.40%           
  Complexity              ?     1418           
===============================================
  Files                   ?      194           
  Lines                   ?     8814           
  Branches                ?      859           
===============================================
  Hits                    ?     4795           
  Misses                  ?     3742           
  Partials                ?      277

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

0rlych1kk4 · 2026-03-23T11:53:34Z

Thanks for the review and approval. From investigation, the failing checks appear related to Cassandra startup in the Python integration environment (DC/rack not found in snitch properties), which happens before the new scheduled-repair scenario is exercised.
Happy to help investigate further or adjust the tests if needed.

tommystendahl · 2026-03-23T13:50:24Z

There seams to be some failure in the pyton integration tests, some of the behave tests fails. I'm not sure why.
https://github.com/Ericsson/ecchronos/actions/runs/23437904346/job/68185727432#step:12:1437

0rlych1kk4 · 2026-03-23T14:47:50Z

Thanks for sharing the logs. From what I’ve seen so far, the failures seem to occur during Cassandra startup in the Python integration environment, before the scheduled-repair scenario is exercised. In particular, there are indications that DC/rack configuration is not being picked up correctly by the snitch, which could explain why some behavior tests are failing.
I was also able to reproduce similar issues locally when environment variables (e.g., CASSANDRA_VERSION, CERTIFICATE_DIRECTORY) are not set, leading to invalid Docker Compose setup and Cassandra not initializing properly.

I’ll continue digging into the CI logs and try to isolate the exact failure point, but sharing this in case it helps narrow down the issue.

0rlych1kk4 · 2026-03-23T15:16:15Z

I dug further into the failing jobs (both Java 17 and 21), and it looks like the cluster and ecChronos start successfully. The failures are coming from Behave assertions rather than Cassandra startup.
Specifically:

The disabled-table repair scenario now returns HTTP 200 instead of failing
test2.table1 appears as VNODE instead of INCREMENTAL in the schedules output

I’ll continue investigating whether this is a behavior change introduced by my changes or if the test expectations need adjustment.

tommystendahl · 2026-05-07T07:57:32Z

@0rlych1kk4 Sorry for being so slow to respond on this but would you mind rebasing this, there has been a lot of changes lately.

0rlych1kk4 · 2026-05-07T12:55:51Z

Hi @tommystendahl , I’ve rebased the branch on top of the latest agent/master and resolved the conflict in test_multi_agent_datacenter_aware.py. The branch is updated now.

I can see the PR now requires a fresh approval because of the rebase/force-push.

tommystendahl · 2026-05-07T13:03:38Z

Thanks @0rlych1kk4 , I will do my best to look at this tomorrow.

tommystendahl · 2026-05-08T10:56:39Z

Hi @0rlych1kk4, I have looked through the test results and there are a few failing tests, you can see them here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177644#step:12:1652 and here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177650
I could also reproduce this localy on your branch with "mvn verify -P local-python-integration-tests -DskipUTs" and "mvn verify -P local-topology-integration-tests -DskipUTs"

0rlych1kk4 · 2026-05-08T17:03:21Z

Hi @tommystendahl ,

I accidentally closed the PR while checking the branch status, but I restored the branch and reopened it immediately.

The PR is active again. I’ll wait for the current checks to complete and will investigate the remaining failing jobs if they persist.

0rlych1kk4 · 2026-05-08T17:14:19Z

Hi @0rlych1kk4, I have looked through the test results and there are a few failing tests, you can see them here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177644#step:12:1652 and here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177650 I could also reproduce this localy on your branch with "mvn verify -P local-python-integration-tests -DskipUTs" and "mvn verify -P local-topology-integration-tests -DskipUTs"

Hi @tommystendahl , thanks for checking and for the repro commands.

I reviewed the two Java 17 failures. They no longer look like Cassandra startup failures.

From the logs:

The Python integration failure is a Behave assertion where the disabled-table repair scenario expects a failed request, but the API returns HTTP 200.
The topology integration failure happens after Cassandra and ecChronos start successfully; test_install_ecchronos expects 42 schedules but now sees 78.

I’ll focus on the test expectations/configuration around schedule generation and the disabled-table repair behavior, then push a focused fix.

…ulti-agent tests - Add scheduled repair concurrency test for multi-agent DatacenterAware mode - Make schedule interval and initial delay configurable per instance - Make schedule overrides opt-in to avoid affecting other tests - Configure fast schedules explicitly for this scenario

…impact)

0rlych1kk4 · 2026-05-10T03:27:46Z

Hi @tommystendahl , I checked the Jolokia Python Integration Java 17 - PEM failure.

This one does not reach the scheduled-repair scenario or the Behave repair tests. It fails during Cassandra test setup while waiting for the ecchronos keyspace:

TimeoutError: Keyspace ecchronos not available after 30 attempts

So this looks like a Cassandra/test-environment readiness timeout rather than a failure caused by the scheduled-repair change. I can look into hardening the shared setup retry/wait logic if you want that included in this PR, but I wanted to confirm first since it would broaden the scope.

0rlych1kk4 requested a review from a team as a code owner February 21, 2026 07:09

0rlych1kk4 force-pushed the feature/scheduled-repair branch 6 times, most recently from 842060c to 41443cc Compare February 21, 2026 12:43

0rlych1kk4 force-pushed the feature/scheduled-repair branch from cadda4e to 684efe5 Compare February 26, 2026 03:09

0rlych1kk4 force-pushed the feature/scheduled-repair branch from 4f461fa to 9022097 Compare March 14, 2026 03:53

0rlych1kk4 force-pushed the feature/scheduled-repair branch from 8a7fcae to d282961 Compare March 17, 2026 15:24

jwaeab previously approved these changes Mar 23, 2026

View reviewed changes

0rlych1kk4 force-pushed the feature/scheduled-repair branch from e7605b8 to 3719743 Compare April 3, 2026 15:08

0rlych1kk4 dismissed jwaeab’s stale review via afc4453 April 3, 2026 15:45

0rlych1kk4 force-pushed the feature/scheduled-repair branch from 2625d5f to a6d925f Compare May 7, 2026 12:30

0rlych1kk4 closed this May 8, 2026

0rlych1kk4 deleted the feature/scheduled-repair branch May 8, 2026 16:43

0rlych1kk4 restored the feature/scheduled-repair branch May 8, 2026 16:53

0rlych1kk4 reopened this May 8, 2026

0rlych1kk4 added 13 commits May 9, 2026 01:19

test: harden ConfigRefresher Awaitility timeouts

afdfc7f

test: make TestScheduleManager idle status deterministic (avoid race)

52ea599

style: format ecc_config.py with black (line-length 120)

4843eb5

test: only mount custom schedule config when overrides are requested

2cc06a6

test: verify completion across all repairs

0d97c39

test: increase repair query window to avoid partial results

7054c13

test: relax on-demand multi-agent repair timeout in CI

0bab74b

test: parse multi-agent repair statuses more robustly

b2b6d5c

test: avoid modifying global scheduler frequency (prevent cross-test …

3eaf759

…impact)

style: format ecc_config.py with black

6188c81

test: always mount schedule.yaml and apply overrides only when specified

7743e51

fix: preserve upstream schedule behavior when YAML is empty

70eaa5a

0rlych1kk4 force-pushed the feature/scheduled-repair branch from a6d925f to 70eaa5a Compare May 8, 2026 17:20

0rlych1kk4 added 3 commits May 9, 2026 21:09

test: align repair and schedule behavior with upstream changes

318eee2

test: remove redundant force disabled repair scenario

89c2253

test: restore disabled repair rejection expectation

ca17197

Conversation

0rlych1kk4 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Test Stability Adjustments

Scope

Uh oh!

0rlych1kk4 commented Feb 21, 2026

Uh oh!

0rlych1kk4 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VictorCavichioli commented Mar 13, 2026

Uh oh!

0rlych1kk4 commented Mar 13, 2026

Uh oh!

0rlych1kk4 commented Mar 14, 2026

Uh oh!

codecov-commenter commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

0rlych1kk4 commented Mar 23, 2026

Uh oh!

tommystendahl commented Mar 23, 2026

Uh oh!

0rlych1kk4 commented Mar 23, 2026

Uh oh!

0rlych1kk4 commented Mar 23, 2026

Uh oh!

tommystendahl commented May 7, 2026

Uh oh!

0rlych1kk4 commented May 7, 2026

Uh oh!

tommystendahl commented May 7, 2026

Uh oh!

tommystendahl commented May 8, 2026

Uh oh!

0rlych1kk4 commented May 8, 2026

Uh oh!

0rlych1kk4 commented May 8, 2026

Uh oh!

0rlych1kk4 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

0rlych1kk4 commented Feb 21, 2026 •

edited

Loading

0rlych1kk4 commented Feb 22, 2026 •

edited

Loading

codecov-commenter commented Mar 17, 2026 •

edited

Loading