Skip to content

Add scheduled repair concurrency scenario for DatacenterAware multi-agent tests#1401

Open
0rlych1kk4 wants to merge 16 commits intoEricsson:agent/masterfrom
0rlych1kk4:feature/scheduled-repair
Open

Add scheduled repair concurrency scenario for DatacenterAware multi-agent tests#1401
0rlych1kk4 wants to merge 16 commits intoEricsson:agent/masterfrom
0rlych1kk4:feature/scheduled-repair

Conversation

@0rlych1kk4
Copy link
Copy Markdown
Contributor

@0rlych1kk4 0rlych1kk4 commented Feb 21, 2026

Summary

This PR is a focused follow-up to #1398 and adds the scheduled-repair concurrency scenario for DatacenterAware multi-agent tests.

Key Changes

  • Adds a scheduled repair concurrency scenario for DatacenterAware multi-agent tests.
  • Introduces opt-in schedule overrides in ecc_config.py so existing tests remain unaffected.
  • Configures fast schedules explicitly for this scenario.

Test Stability Adjustments

The small updates in TestConfigRefresher and TestScheduleManager are included only to make the scheduling/concurrency path deterministic and avoid race-based failures during test execution.

These adjustments do not change production logic and only affect test reliability.

Scope

The changes are intentionally narrow and extend the DatacenterAware multi-agent testing introduced in #1398.

@0rlych1kk4 0rlych1kk4 requested a review from a team as a code owner February 21, 2026 07:09
@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch 6 times, most recently from 842060c to 41443cc Compare February 21, 2026 12:43
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @VictorCavichioli,
This PR adds the scheduled-repair variant of the DatacenterAware multi-agent lock concurrency scenario discussed in #1382 .
It mirrors the on-demand test but runs scheduled repairs with different intervals per instance (dc1 / dc2) to exercise lock behavior under schedule-driven execution.
Schedule configuration is scoped to the test instances so existing scenarios remain unaffected.
Build, standalone integration tests, and license checks pass locally.
Please let me know if you’d like additional timing combinations or edge cases covered.

@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

0rlych1kk4 commented Feb 22, 2026

Follow-up pushed to make TestScheduleManager idle-state validation deterministic by removing the scheduler thread startup in testGetCurrentJobStatusNoRunning().
The test now verifies the idle contract directly and avoids a timing race.
Verified locally:
mvn -pl core.impl test "-Dtest=TestScheduleManager" (all green).
Changes are test-only and focused on stability (no production impact).

@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch from cadda4e to 684efe5 Compare February 26, 2026 03:09
@VictorCavichioli
Copy link
Copy Markdown
Collaborator

Hi @0rlych1kk4 , the main context of this test was merged, so feel free to rebase your change with agent/master branch so we can merge yours as well

@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Thanks, @VictorCavichioli . I’ll rebase this branch onto agent/master, resolve the remaining conflict, rerun the affected checks, and push an updated version.

@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch from 4f461fa to 9022097 Compare March 14, 2026 03:53
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @VictorCavichioli , rebased on top of agent/master and updated the branch. Please let me know if you’d like any further adjustments.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 17, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (agent/master@c6467f4). Learn more about missing BASE report.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@               Coverage Diff               @@
##             agent/master    #1401   +/-   ##
===============================================
  Coverage                ?   54.40%           
  Complexity              ?     1418           
===============================================
  Files                   ?      194           
  Lines                   ?     8814           
  Branches                ?      859           
===============================================
  Hits                    ?     4795           
  Misses                  ?     3742           
  Partials                ?      277           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch from 8a7fcae to d282961 Compare March 17, 2026 15:24
jwaeab
jwaeab previously approved these changes Mar 23, 2026
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Thanks for the review and approval. From investigation, the failing checks appear related to Cassandra startup in the Python integration environment (DC/rack not found in snitch properties), which happens before the new scheduled-repair scenario is exercised.
Happy to help investigate further or adjust the tests if needed.

@tommystendahl
Copy link
Copy Markdown
Collaborator

There seams to be some failure in the pyton integration tests, some of the behave tests fails. I'm not sure why.
https://github.com/Ericsson/ecchronos/actions/runs/23437904346/job/68185727432#step:12:1437

@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Thanks for sharing the logs. From what I’ve seen so far, the failures seem to occur during Cassandra startup in the Python integration environment, before the scheduled-repair scenario is exercised. In particular, there are indications that DC/rack configuration is not being picked up correctly by the snitch, which could explain why some behavior tests are failing.
I was also able to reproduce similar issues locally when environment variables (e.g., CASSANDRA_VERSION, CERTIFICATE_DIRECTORY) are not set, leading to invalid Docker Compose setup and Cassandra not initializing properly.

I’ll continue digging into the CI logs and try to isolate the exact failure point, but sharing this in case it helps narrow down the issue.

@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

I dug further into the failing jobs (both Java 17 and 21), and it looks like the cluster and ecChronos start successfully. The failures are coming from Behave assertions rather than Cassandra startup.
Specifically:

  • The disabled-table repair scenario now returns HTTP 200 instead of failing
  • test2.table1 appears as VNODE instead of INCREMENTAL in the schedules output

I’ll continue investigating whether this is a behavior change introduced by my changes or if the test expectations need adjustment.

@tommystendahl
Copy link
Copy Markdown
Collaborator

@0rlych1kk4 Sorry for being so slow to respond on this but would you mind rebasing this, there has been a lot of changes lately.

@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch from 2625d5f to a6d925f Compare May 7, 2026 12:30
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @tommystendahl , I’ve rebased the branch on top of the latest agent/master and resolved the conflict in test_multi_agent_datacenter_aware.py. The branch is updated now.

I can see the PR now requires a fresh approval because of the rebase/force-push.

@tommystendahl
Copy link
Copy Markdown
Collaborator

Thanks @0rlych1kk4 , I will do my best to look at this tomorrow.

@tommystendahl
Copy link
Copy Markdown
Collaborator

Hi @0rlych1kk4, I have looked through the test results and there are a few failing tests, you can see them here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177644#step:12:1652 and here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177650
I could also reproduce this localy on your branch with "mvn verify -P local-python-integration-tests -DskipUTs" and "mvn verify -P local-topology-integration-tests -DskipUTs"

@0rlych1kk4 0rlych1kk4 closed this May 8, 2026
@0rlych1kk4 0rlych1kk4 deleted the feature/scheduled-repair branch May 8, 2026 16:43
@0rlych1kk4 0rlych1kk4 restored the feature/scheduled-repair branch May 8, 2026 16:53
@0rlych1kk4 0rlych1kk4 reopened this May 8, 2026
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @tommystendahl ,

I accidentally closed the PR while checking the branch status, but I restored the branch and reopened it immediately.

The PR is active again. I’ll wait for the current checks to complete and will investigate the remaining failing jobs if they persist.

@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @0rlych1kk4, I have looked through the test results and there are a few failing tests, you can see them here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177644#step:12:1652 and here https://github.com/Ericsson/ecchronos/actions/runs/25495882051/job/74961177650 I could also reproduce this localy on your branch with "mvn verify -P local-python-integration-tests -DskipUTs" and "mvn verify -P local-topology-integration-tests -DskipUTs"

Hi @tommystendahl , thanks for checking and for the repro commands.

I reviewed the two Java 17 failures. They no longer look like Cassandra startup failures.

From the logs:

  • The Python integration failure is a Behave assertion where the disabled-table repair scenario expects a failed request, but the API returns HTTP 200.
  • The topology integration failure happens after Cassandra and ecChronos start successfully; test_install_ecchronos expects 42 schedules but now sees 78.

I’ll focus on the test expectations/configuration around schedule generation and the disabled-table repair behavior, then push a focused fix.

@0rlych1kk4 0rlych1kk4 force-pushed the feature/scheduled-repair branch from a6d925f to 70eaa5a Compare May 8, 2026 17:20
@0rlych1kk4
Copy link
Copy Markdown
Contributor Author

Hi @tommystendahl , I checked the Jolokia Python Integration Java 17 - PEM failure.

This one does not reach the scheduled-repair scenario or the Behave repair tests. It fails during Cassandra test setup while waiting for the ecchronos keyspace:

TimeoutError: Keyspace ecchronos not available after 30 attempts

So this looks like a Cassandra/test-environment readiness timeout rather than a failure caused by the scheduled-repair change. I can look into hardening the shared setup retry/wait logic if you want that included in this PR, but I wanted to confirm first since it would broaden the scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants