feat(observability+resilience): MCPP_VERBOSE in all CI + cold index-update retry#182
Merged
Merged
Conversation
…retry Two changes prompted by an opaque, repeatedly-reproducing CI failure in 75_index_status_offline.sh (a cold `mcpp self env` in a fresh MCPP_HOME whose index/sandbox bootstrap fails fast — root cause swallowed in non-verbose logs; verified NOT related to the feature/capability work, it fails identically on main): 1. MCPP_VERBOSE env override (src/cli.cppm): MCPP_VERBOSE=<non-empty,!=0> turns on verbose logging for EVERY mcpp invocation, including those nested inside e2e test scripts that call $MCPP without flags. An explicit --quiet still wins. Set MCPP_VERBOSE: "1" in all CI workflows (ci-linux, ci-linux-e2e, ci-macos, ci-windows, cross-build-test, ci-fresh-install, ci-aarch64-fresh-install) so failures carry full diagnostics. Complements the existing MCPP_LOG_LEVEL (which only sets the FILE log level, not stderr). 2. update_index retry (src/xlings.cppm): the index sync is a network git op; a single transient blip otherwise fails cold init outright. Bounded retry with linear backoff (3 attempts, 2s/4s). Success returns on the first attempt — zero added latency in steady state; only a real failure pays the backoff. Retry notices go through mcpp::log::verbose (file always, stderr under MCPP_VERBOSE). This PR's own e2e CI run carries MCPP_VERBOSE=1, so it surfaces the real cause of the 75 bootstrap failure for a targeted follow-up.
…al stability fix) Implements TODO(mirror-default) option (b). The first-init seed used a hardcoded "CN" mirror, which strands overseas users and GitHub-hosted CI behind a slow/unreachable gitcode mirror — the actual root cause behind the repeated 75_index_status_offline.sh cold-bootstrap failures (a US runner seeding CN can't reach gitcode, so the index clone fails fast and the index reports 'missing'). detect_best_mirror() runs a short, tight-timeout HEAD probe to github.com (GLOBAL) and gitcode.com (CN), and pins the lower-latency reachable one into .xlings.json. Priority matches the intended design: explicit --mirror (config) > lower-latency auto-probe > GLOBAL fallback An explicit `mcpp self config --mirror CN|GLOBAL` always wins; the probe only runs on a fresh init with no explicit choice. Falls back to GLOBAL (reachable nearly everywhere) if neither host answers. Verified locally: on a CN host the probe picks CN (gitcode 150ms < github 380ms) and logs 'mirror: probe github=380ms gitcode=150ms -> CN' under MCPP_VERBOSE; a US CI runner will symmetrically pick GLOBAL. e2e 75/80/81 green.
… curl probe)
Replaces the curl-based mcpp-side probe (per review: don't reinvent in mcpp;
mirror selection is xlings' job, and it already does it via tinyhttps).
Root cause of the repeated 75_index_status_offline.sh CI failures, now proven:
mcpp seeded a hardcoded "CN" into a fresh .xlings.json. xlings' normalize_mirror_
accepts only "GLOBAL"/"CN" as valid, so "CN" was used DIRECTLY (gitcode) —
bypassing xlings' own detect_install_mirror_(), which probes github vs gitcode
latency (tinyhttps::probe_latency) and picks the reachable/faster region. On a
GitHub-hosted (US) runner gitcode is slow/unreachable, so the cold index/sandbox
bootstrap failed and the index reported 'missing'.
Fix: seed "auto". normalize_mirror_("auto") -> nullopt (invalid) -> xlings
treats the mirror as unset -> runs detect_install_mirror_() -> picks GLOBAL on a
US runner, CN on a China link. mcpp no longer overrides xlings' region choice.
The existing-config guard (config.cppm: only seed when .xlings.json is absent)
already means an explicit `mcpp self config --mirror CN|GLOBAL` is never clobbered.
Empirically confirmed via the curl-probe build's verbose CI run: on the US runner
'probe github=70ms gitcode=1060ms -> GLOBAL' then patchelf/ninja bootstrap
succeeded and 75 PASSED — proving the mirror region is the cause. This commit
hands that region choice back to xlings instead of doing it in mcpp.
…uite jobs Follow-up to the mirror=auto fix and MCPP_VERBOSE rollout, fixing 3 e2e regressions surfaced by the previous CI run (75 itself now PASSES): - 38_self_config_mirror.sh: the default seed is now "auto" (defer to xlings' region detection), not "CN". Updated the assertion; explicit --mirror GLOBAL/cn/BAD checks unchanged. - 48_build_error_output.sh / 53_namespaced_cache_label.sh assert mcpp's DEFAULT (quiet) output. Forcing MCPP_VERBOSE=1 in CI broke them. Removed MCPP_VERBOSE from the workflows that run the e2e suite (ci-linux-e2e, ci-macos, ci-windows); kept it where it's diagnostic and safe (ci-fresh-install, ci-aarch64-fresh-install — the cold bootstrap path — plus ci-linux unit tests and cross-build). A test that needs verbose passes --verbose itself. Verified locally: 38/48/53 green.
Bump mcpp.toml + MCPP_VERSION to 0.0.70. CHANGELOG: seed "auto" so xlings' region detection runs (fixes overseas/CI cold bootstrap), MCPP_VERBOSE env, update_index retry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
75_index_status_offline.shhas failed 3× in a row (PR #181 twice + main post-merge) with an identical, opaque cause: a coldmcpp self envin a freshMCPP_HOMEwhose index/sandbox bootstrap fails fast (~9s) → indexes reportmissing→ the test'sindex statusassertion fails. The realxlings update/ patchelf-install error is swallowed in non-verbose logs.Confirmed NOT related to the feature/capability work (#181): it fails identically on
mainafter that merge, and the diff there doesn't touch the bootstrap/index code path.mainwas green through 0.0.68 (2026-06-27), so this is a recently-surfaced environment/bootstrap fragility.What
MCPP_VERBOSEenv override (src/cli.cppm) —MCPP_VERBOSE=<non-empty,≠0>turns on verbose logging for every mcpp invocation, including those nested inside e2e scripts that call$MCPPwithout flags (YAML can't reach those). SetMCPP_VERBOSE: "1"in all CI workflows. Complements the existingMCPP_LOG_LEVEL(which only sets the file log level, not stderr — invisible in CI logs). An explicit--quietstill wins.Cold index-update retry (
src/xlings.cppm) —update_indexruns a network git op once; a single transient blip fails cold init outright. Bounded retry with linear backoff (3 attempts, 2s/4s). Success returns on the first attempt — zero added latency in steady state; only a genuine failure pays the backoff. Retry notices go throughmcpp::log::verbose.Self-validating
This PR's own e2e CI run carries
MCPP_VERBOSE=1, so it will surface the real cause of the 75 bootstrap failure (currently hidden). The retry covers the transient case; if the verbose output reveals a persistent fast-fail (e.g. a moved mirror / broken URL), a targeted follow-up will address that.No version bump (CI/diagnostics + internal resilience;
mcpp.toml/MCPP_VERSIONstay 0.0.69, consistent).