Skip to content

feat(observability+resilience): MCPP_VERBOSE in all CI + cold index-update retry#182

Merged
Sunrisepeak merged 5 commits into
mainfrom
feat/ci-observability-index-resilience
Jun 29, 2026
Merged

feat(observability+resilience): MCPP_VERBOSE in all CI + cold index-update retry#182
Sunrisepeak merged 5 commits into
mainfrom
feat/ci-observability-index-resilience

Conversation

@Sunrisepeak

Copy link
Copy Markdown
Member

Why

75_index_status_offline.sh has failed 3× in a row (PR #181 twice + main post-merge) with an identical, opaque cause: a cold mcpp self env in a fresh MCPP_HOME whose index/sandbox bootstrap fails fast (~9s) → indexes report missing → the test's index status assertion fails. The real xlings update / patchelf-install error is swallowed in non-verbose logs.

Confirmed NOT related to the feature/capability work (#181): it fails identically on main after that merge, and the diff there doesn't touch the bootstrap/index code path. main was green through 0.0.68 (2026-06-27), so this is a recently-surfaced environment/bootstrap fragility.

What

  1. MCPP_VERBOSE env override (src/cli.cppm) — MCPP_VERBOSE=<non-empty,≠0> turns on verbose logging for every mcpp invocation, including those nested inside e2e scripts that call $MCPP without flags (YAML can't reach those). Set MCPP_VERBOSE: "1" in all CI workflows. Complements the existing MCPP_LOG_LEVEL (which only sets the file log level, not stderr — invisible in CI logs). An explicit --quiet still wins.

  2. Cold index-update retry (src/xlings.cppm) — update_index runs a network git op once; a single transient blip fails cold init outright. Bounded retry with linear backoff (3 attempts, 2s/4s). Success returns on the first attempt — zero added latency in steady state; only a genuine failure pays the backoff. Retry notices go through mcpp::log::verbose.

Self-validating

This PR's own e2e CI run carries MCPP_VERBOSE=1, so it will surface the real cause of the 75 bootstrap failure (currently hidden). The retry covers the transient case; if the verbose output reveals a persistent fast-fail (e.g. a moved mirror / broken URL), a targeted follow-up will address that.

No version bump (CI/diagnostics + internal resilience; mcpp.toml/MCPP_VERSION stay 0.0.69, consistent).

…retry

Two changes prompted by an opaque, repeatedly-reproducing CI failure in
75_index_status_offline.sh (a cold `mcpp self env` in a fresh MCPP_HOME whose
index/sandbox bootstrap fails fast — root cause swallowed in non-verbose logs;
verified NOT related to the feature/capability work, it fails identically on
main):

1. MCPP_VERBOSE env override (src/cli.cppm): MCPP_VERBOSE=<non-empty,!=0> turns
   on verbose logging for EVERY mcpp invocation, including those nested inside
   e2e test scripts that call $MCPP without flags. An explicit --quiet still
   wins. Set MCPP_VERBOSE: "1" in all CI workflows (ci-linux, ci-linux-e2e,
   ci-macos, ci-windows, cross-build-test, ci-fresh-install,
   ci-aarch64-fresh-install) so failures carry full diagnostics. Complements the
   existing MCPP_LOG_LEVEL (which only sets the FILE log level, not stderr).

2. update_index retry (src/xlings.cppm): the index sync is a network git op;
   a single transient blip otherwise fails cold init outright. Bounded retry
   with linear backoff (3 attempts, 2s/4s). Success returns on the first
   attempt — zero added latency in steady state; only a real failure pays the
   backoff. Retry notices go through mcpp::log::verbose (file always, stderr
   under MCPP_VERBOSE).

This PR's own e2e CI run carries MCPP_VERBOSE=1, so it surfaces the real cause
of the 75 bootstrap failure for a targeted follow-up.
…al stability fix)

Implements TODO(mirror-default) option (b). The first-init seed used a hardcoded
"CN" mirror, which strands overseas users and GitHub-hosted CI behind a
slow/unreachable gitcode mirror — the actual root cause behind the repeated
75_index_status_offline.sh cold-bootstrap failures (a US runner seeding CN can't
reach gitcode, so the index clone fails fast and the index reports 'missing').

detect_best_mirror() runs a short, tight-timeout HEAD probe to github.com
(GLOBAL) and gitcode.com (CN), and pins the lower-latency reachable one into
.xlings.json. Priority matches the intended design:
  explicit --mirror (config)  >  lower-latency auto-probe  >  GLOBAL fallback
An explicit `mcpp self config --mirror CN|GLOBAL` always wins; the probe only
runs on a fresh init with no explicit choice. Falls back to GLOBAL (reachable
nearly everywhere) if neither host answers.

Verified locally: on a CN host the probe picks CN (gitcode 150ms < github 380ms)
and logs 'mirror: probe github=380ms gitcode=150ms -> CN' under MCPP_VERBOSE; a
US CI runner will symmetrically pick GLOBAL. e2e 75/80/81 green.
… curl probe)

Replaces the curl-based mcpp-side probe (per review: don't reinvent in mcpp;
mirror selection is xlings' job, and it already does it via tinyhttps).

Root cause of the repeated 75_index_status_offline.sh CI failures, now proven:
mcpp seeded a hardcoded "CN" into a fresh .xlings.json. xlings' normalize_mirror_
accepts only "GLOBAL"/"CN" as valid, so "CN" was used DIRECTLY (gitcode) —
bypassing xlings' own detect_install_mirror_(), which probes github vs gitcode
latency (tinyhttps::probe_latency) and picks the reachable/faster region. On a
GitHub-hosted (US) runner gitcode is slow/unreachable, so the cold index/sandbox
bootstrap failed and the index reported 'missing'.

Fix: seed "auto". normalize_mirror_("auto") -> nullopt (invalid) -> xlings
treats the mirror as unset -> runs detect_install_mirror_() -> picks GLOBAL on a
US runner, CN on a China link. mcpp no longer overrides xlings' region choice.
The existing-config guard (config.cppm: only seed when .xlings.json is absent)
already means an explicit `mcpp self config --mirror CN|GLOBAL` is never clobbered.

Empirically confirmed via the curl-probe build's verbose CI run: on the US runner
'probe github=70ms gitcode=1060ms -> GLOBAL' then patchelf/ninja bootstrap
succeeded and 75 PASSED — proving the mirror region is the cause. This commit
hands that region choice back to xlings instead of doing it in mcpp.
…uite jobs

Follow-up to the mirror=auto fix and MCPP_VERBOSE rollout, fixing 3 e2e
regressions surfaced by the previous CI run (75 itself now PASSES):

- 38_self_config_mirror.sh: the default seed is now "auto" (defer to xlings'
  region detection), not "CN". Updated the assertion; explicit --mirror
  GLOBAL/cn/BAD checks unchanged.
- 48_build_error_output.sh / 53_namespaced_cache_label.sh assert mcpp's DEFAULT
  (quiet) output. Forcing MCPP_VERBOSE=1 in CI broke them. Removed MCPP_VERBOSE
  from the workflows that run the e2e suite (ci-linux-e2e, ci-macos, ci-windows);
  kept it where it's diagnostic and safe (ci-fresh-install,
  ci-aarch64-fresh-install — the cold bootstrap path — plus ci-linux unit tests
  and cross-build). A test that needs verbose passes --verbose itself.

Verified locally: 38/48/53 green.
Bump mcpp.toml + MCPP_VERSION to 0.0.70. CHANGELOG: seed "auto" so xlings'
region detection runs (fixes overseas/CI cold bootstrap), MCPP_VERBOSE env,
update_index retry.
@Sunrisepeak Sunrisepeak merged commit dbcbe7a into main Jun 29, 2026
@Sunrisepeak Sunrisepeak deleted the feat/ci-observability-index-resilience branch June 29, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant