Skip to content

ci: CI online test#596

Merged
voltjia merged 102 commits into
masterfrom
ci/ci-online
May 19, 2026
Merged

ci: CI online test#596
voltjia merged 102 commits into
masterfrom
ci/ci-online

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 12, 2026

Summary

  • Replace the in-tree .ci tooling with the external InfiniTensor/ci submodule and pin the reusable CI workflows to a concrete CI tooling commit.
  • Add .github/ci_config.yml as the InfiniOps-owned platform CI configuration for NVIDIA, Iluvatar, MetaX, Moore, Cambricon, and Ascend.
  • Add the legacy CI workflow and the new CI v2 Shadow workflow with manual platform dispatch support.
  • Add CI v2 local-agent execution, queued-job watchdogs, best-effort runner availability preflight, junit/exit-code result checks, and all-platform shadow validation.

Motivation

The existing CI path was hard to diagnose across multiple self-hosted accelerator platforms. This PR makes platform selection, runner labels, queue timeout behavior, CI v2 shadow execution, and failure reporting explicit so PR CI can be validated across all supported hardware backends.

Closes N/A.

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA Yes Pass ci / unit / nvidia and ci-v2-shadow / nvidia passed.
Iluvatar Yes Pass ci / unit / iluvatar and ci-v2-shadow / iluvatar passed.
MetaX Yes Pass ci / unit / metax and ci-v2-shadow / metax passed.
Cambricon Yes Pass ci / unit / cambricon and ci-v2-shadow / cambricon passed.
Moore Yes Pass ci / unit / moore and ci-v2-shadow / moore passed.
Ascend Yes Pass ci / unit / ascend and ci-v2-shadow / ascend passed.
Full `pytest` output (optional)
PR checks for https://github.com/InfiniTensor/InfiniOps/pull/596 at commit 2793d9a:

ci / Generate matrix from config: pass
ci / unit / nvidia: pass
ci / unit / iluvatar: pass
ci / unit / metax: pass
ci / unit / moore: pass
ci / unit / cambricon: pass
ci / unit / ascend: pass
ci / Fail queued CI jobs after 10 minutes: pass

ci-v2-shadow / Generate CI v2 shadow matrix: pass
ci-v2-shadow / ci-v2-shadow / nvidia: pass
ci-v2-shadow / ci-v2-shadow / iluvatar: pass
ci-v2-shadow / ci-v2-shadow / metax: pass
ci-v2-shadow / ci-v2-shadow / moore: pass
ci-v2-shadow / ci-v2-shadow / cambricon: pass
ci-v2-shadow / ci-v2-shadow / ascend: pass
ci-v2-shadow / Fail queued CI v2 jobs after 10 minutes: pass

ruff: pass
clang-format: pass

Legacy CI run: https://github.com/InfiniTensor/InfiniOps/actions/runs/26022246657
CI v2 Shadow run: https://github.com/InfiniTensor/InfiniOps/actions/runs/26022246610

Benchmark / Performance Impact

N/A. This PR changes CI configuration and CI execution behavior only; it does not change operator kernels or runtime performance paths.

Notes for Reviewers

  • CI v2 Shadow is intentionally added as a shadow workflow. The legacy CI workflow remains available for PR validation while the v2 path is exercised across all platforms.
  • Runner availability preflight is best effort. If CI_RUNNER_STATUS_TOKEN is configured, prepare can fail before matrix jobs are created when no runner is online. Without that token, queued-job watchdogs remain the fallback.
  • The queued-job watchdog exits early once all expected platform jobs complete, and fails clearly when jobs remain queued past the configured timeout.
  • The .ci implementation lives in InfiniTensor/ci; this PR pins the reusable workflow and ci_ref to a concrete commit instead of following a moving branch head.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits. This PR is intended to be squash-merged as one ci: change.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

N/A. No C++ source files are changed by this PR.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Type hints are added / kept consistent with the surrounding code.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
  • Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (e.g. @pytest.mark.parametrize("dtype, rtol, atol", …)), independent parameters use separate decorators ordered by parameter declaration.
  • Where appropriate, pytest.mark.auto_act_and_assert is used and the test returns a Payload whose func and ref share the same calling convention.
  • Default dtype / device parameterization is relied on, or overridden with an explicit pytest.mark.parametrize when necessary.
  • Any new test that is flaky under parallelism is marked so, or documented to require pytest -n 1.
  • For bug fixes: a regression test has been added that fails on master and passes with this PR.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
  • compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
  • New backends / devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) and to if(AUTO_DETECT_BACKENDS) if applicable.
  • Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
  • Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

Comment thread .github/ci_config.yml Outdated
Comment thread .github/ci_config.yml Outdated
Comment thread .github/ci_config.yml Outdated
Comment thread third_party/flashinfer Outdated
Comment thread tests/test_gemm.py Outdated
Comment thread .github/ci_config.yml Outdated
@zhangyue207 zhangyue207 marked this pull request as ready for review May 13, 2026 16:47
@zhangyue207 zhangyue207 requested a review from a team May 13, 2026 16:47
@zhangyue207 zhangyue207 force-pushed the ci/ci-online branch 7 times, most recently from 18c4125 to 4053480 Compare May 13, 2026 19:57
@zhangyue207 zhangyue207 changed the title CI online test ci: CI online test May 14, 2026
Comment thread src/operator.h
bitzyz
bitzyz previously approved these changes May 15, 2026
@bitzyz
Copy link
Copy Markdown
Contributor

bitzyz commented May 15, 2026

ci后续再过一下

@zhangyue207 zhangyue207 reopened this May 18, 2026
Comment thread .github/workflows/ci_v2_shadow.yml
@zhangyue207 zhangyue207 requested a review from bitzyz May 19, 2026 01:30
@zhangyue207
Copy link
Copy Markdown
Collaborator Author

ci后续再过一下

已过

@voltjia voltjia self-requested a review May 19, 2026 02:00
@voltjia voltjia merged commit 41812a1 into master May 19, 2026
20 checks passed
@voltjia voltjia deleted the ci/ci-online branch May 19, 2026 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants