Skip to content

CI: tidy nightly test-matrix + bump torch to 2.12.1#2272

Open
leofang wants to merge 5 commits into
NVIDIA:mainfrom
leofang:leofang/ci-test-matrix-env-refactor
Open

CI: tidy nightly test-matrix + bump torch to 2.12.1#2272
leofang wants to merge 5 commits into
NVIDIA:mainfrom
leofang:leofang/ci-test-matrix-env-refactor

Conversation

@leofang

@leofang leofang commented Jun 27, 2026

Copy link
Copy Markdown
Member

Summary

  • Collapse per-row MODE / TORCH_VER / TORCH_CUDA of nightly entries into the ENV: map so they ride the existing matrix-env injection step in test-wheel-{linux,windows}.yml. Workflow selectors (ci-nightly.yml) and job-name strings updated accordingly.
  • Bump latest-PyTorch rows from 2.11.02.12.1; 2.9.1 rows unchanged.
  • Job names now also show the torch CUDA suffix, e.g. , 2.12.1+cu126.
  • Align nightly section columns with the pull-request rows for readability.
  • Add a nightly-standard arm64 gh200 row, but comment it out for now: the gh200 runner currently hangs on stream-ordered memory allocator (cudaMallocAsync) calls. The row is left in place (with a TODO) so it can be re-enabled once the runner-side issue is resolved.

Test plan

  • Verify nightly matrix expansion across modes (nightly-pytorch, nightly-numba-cuda, nightly-standard) via a workflow run.
  • PyTorch 2.12.1 wheels (cu126 / cu130) install cleanly.

- ci/test-matrix.yml: move per-row MODE/TORCH_VER/TORCH_CUDA into the
  ENV map (rides the existing matrix-env injection step). Add a
  nightly-standard arm64 gh200 row. Bump latest-PyTorch rows from
  2.11.0 to 2.12.1; 2.9.1 rows untouched.
- .github/workflows/ci-nightly.yml: matrix_filter selectors now key on
  .ENV.MODE.
- .github/workflows/test-wheel-{linux,windows}.yml: job-name format
  strings read TORCH_VER/MODE from matrix.ENV; TORCH_CUDA also rendered
  in the name (e.g. ", 2.12.1+cu126"). Drop the now-redundant
  TORCH_VER/TORCH_CUDA lines from the pytorch step's env block.
@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label Jun 27, 2026
leofang added 2 commits June 27, 2026 04:18
Pad PY_VER and GPU columns in the nightly section to match the widths
used by the pull-request rows above (17-char PY_VER, 19-char GPU).
Purely cosmetic; YAML parse and matrix expansion unchanged.
@leofang

leofang commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

/ok to test d29bc34

Comment thread ci/test-matrix.yml Outdated
- { ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-numba-cuda' } }
- { ARCH: 'arm64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-numba-cuda' } }
# nightly-standard (arm64 nightly-only runners — per runner team request)
- { ARCH: 'arm64', PY_VER: '3.14', CUDA_VER: '13.3.0', LOCAL_CTK: '1', GPU: 'gh200', GPU_COUNT: '1', DRIVER: 'latest', ENV: { MODE: 'nightly-standard' } }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an experimental G+H pipeline here (cc @kkraus14 for vis).

@github-actions

Copy link
Copy Markdown

@leofang

leofang commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

leofang added 2 commits June 28, 2026 02:55
The gh200 runner currently hangs on stream-ordered memory allocator
calls (cudaMallocAsync). Disabling until the runner-side issue is
resolved.
@leofang leofang changed the title CI: tidy nightly test-matrix + add arm64 gh200 + bump torch 2.12.1 CI: tidy nightly test-matrix + bump torch to 2.12.1 Jun 28, 2026
@leofang leofang marked this pull request as ready for review June 28, 2026 03:02
@leofang

leofang commented Jun 28, 2026

Copy link
Copy Markdown
Member Author

/ok to test 7e002a6

@leofang leofang self-assigned this Jun 28, 2026
@leofang leofang added this to the cuda.core next milestone Jun 28, 2026
@leofang leofang added enhancement Any code-related improvements P1 Medium priority - Should do labels Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD CI/CD infrastructure enhancement Any code-related improvements P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant