CI: tidy nightly test-matrix + bump torch to 2.12.1#2272
Open
leofang wants to merge 5 commits into
Open
Conversation
- ci/test-matrix.yml: move per-row MODE/TORCH_VER/TORCH_CUDA into the
ENV map (rides the existing matrix-env injection step). Add a
nightly-standard arm64 gh200 row. Bump latest-PyTorch rows from
2.11.0 to 2.12.1; 2.9.1 rows untouched.
- .github/workflows/ci-nightly.yml: matrix_filter selectors now key on
.ENV.MODE.
- .github/workflows/test-wheel-{linux,windows}.yml: job-name format
strings read TORCH_VER/MODE from matrix.ENV; TORCH_CUDA also rendered
in the name (e.g. ", 2.12.1+cu126"). Drop the now-redundant
TORCH_VER/TORCH_CUDA lines from the pytorch step's env block.
Contributor
Pad PY_VER and GPU columns in the nightly section to match the widths used by the pull-request rows above (17-char PY_VER, 19-char GPU). Purely cosmetic; YAML parse and matrix expansion unchanged.
Remove before merging.
Member
Author
|
/ok to test d29bc34 |
leofang
commented
Jun 27, 2026
| - { ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-numba-cuda' } } | ||
| - { ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-numba-cuda' } } | ||
| # nightly-standard (arm64 nightly-only runners — per runner team request) | ||
| - { ARCH: 'arm64', PY_VER: '3.14', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'gh200', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-standard' } } |
Member
Author
There was a problem hiding this comment.
Adding an experimental G+H pipeline here (cc @kkraus14 for vis).
|
Member
Author
|
Killed the hanging G+H pipeline: @bdice also saw the same issue in RMM: https://github.com/rapidsai/rmm/actions/runs/28270219058/job/83767744891?pr=2457. Will revisit later... |
The gh200 runner currently hangs on stream-ordered memory allocator calls (cudaMallocAsync). Disabling until the runner-side issue is resolved.
This reverts commit d29bc34.
Member
Author
|
/ok to test 7e002a6 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MODE/TORCH_VER/TORCH_CUDAof nightly entries into theENV:map so they ride the existing matrix-env injection step intest-wheel-{linux,windows}.yml. Workflow selectors (ci-nightly.yml) and job-name strings updated accordingly.2.11.0→2.12.1;2.9.1rows unchanged., 2.12.1+cu126.nightly-standardarm64gh200row, but comment it out for now: the gh200 runner currently hangs on stream-ordered memory allocator (cudaMallocAsync) calls. The row is left in place (with a TODO) so it can be re-enabled once the runner-side issue is resolved.Test plan
nightly-pytorch,nightly-numba-cuda,nightly-standard) via a workflow run.