Skip to content

Harden the dual-era stream loop's era-lock and rejection semantics#3040

Open
maxisbey wants to merge 3 commits into
mainfrom
dual-era-loop-hardening
Open

Harden the dual-era stream loop's era-lock and rejection semantics#3040
maxisbey wants to merge 3 commits into
mainfrom
dual-era-loop-hardening

Conversation

@maxisbey

@maxisbey maxisbey commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Post-merge review follow-up to #3038. An adversarial review pass over the dual-era driver found a family of cases where a request that failed could still lock the connection's era — and on stdio a connection is a whole subprocess, so a locked-then-stranded connection has no recovery short of a restart.

The era-lock fixes

The invariant the driver documents is: released auto-negotiating clients fall back on any error code except -32022, so a failed probe must leave the legacy handshake available. Three openers violated it:

  1. Well-formed envelope, malformed content (e.g. clientInfo: 42): classification checks key presence only, so the era locked modern before pydantic rejected the content. The request failed with a fallback-eligible -32602, but the fallback then hit -32022 forever. The lock now commits only after the request succeeds, mirroring the legacy side's existing "lock only on success" rule.
  2. initialize stamped with the envelope triple: routed modern (the triple won), failed METHOD_NOT_FOUND (initialize doesn't exist at 2026-07-28), and locked the era anyway. initialize is legacy-distinctive by definition, so it now always takes the handshake path regardless of decoration.
  3. subscriptions/listen as the opener: the not-served-on-this-transport rejection used to land after the lock; now, like every failed request, it locks nothing.

No conforming client — released SDK or otherwise — produces any of these frames; the fixes are robustness against buggy third-party modern clients.

Compatibility restoration

A bare server/discover on a legacy-locked connection now falls through to the loop runner's per-version validation, restoring the byte-identical METHOD_NOT_FOUND (-32601, same message and data) a handshake-only server produced — #3038 had it rejecting with -32600. Only envelope-bearing frames get the cross-era INVALID_REQUEST; a conforming legacy client can never send the reserved triple, so legacy traffic is byte-identical again in full.

Note for anyone bisecting client behavior: v2.0.0b1 ships the -32600 variant for this one frame shape (it carries #3038 but not this fix), so a legacy-locked bare server/discover answers -32600 on b1 and -32601 before #3038 and from this PR onward.

Smaller fixes

  • classify_inbound_request rejects a present-but-non-string protocol version with a descriptive INVALID_PARAMS instead of raising ValidationError out of its own -32022 payload construction (requested is a str field). The dispatcher's exception ladder happened to mask that as a generic -32602 — same code, now deliberate, with a message that names the defect. Unreachable over HTTP (the header rung fires first), so the wire delta is stdio-only.
  • The modern dispatch wrapper masks transport.can_send_request so per-message transport metadata agrees with its denial, matching the modern HTTP entry's context — closing a latent inconsistency before anything consumes the field.
  • NotifyOnlyOutbound inherits the no-channel refusal instead of duplicating it; the has_standalone_channel/Raises contracts now document that a modern connection refuses requests on a real channel.
  • Docs: migration guide + protocol-versions page note the probe is transport-independent; the subscriptions page documents the streamable-HTTP-only limitation of subscriptions/listen; a stale serve_loop docstring claim is corrected. Two pyright suppressions in the stdio test doubles are replaced with typeshed-compatible Buffer signatures.

Validation

Every fix has a pinning test (rejection codes, lock neutrality of each failing opener, the byte-identical legacy fall-through, the classifier guard, the transport mask). Beyond the suite (100% coverage), each formerly-stranding opener was driven against a live stdio subprocess and shown to leave the legacy handshake working on the same pipe, with the full original probe matrix (auto/legacy/pinned negotiation, pipelined probe ordering, -32022 round-trip) passing unchanged.

Round 2 — review follow-ups

Review found that the round-1 lock-on-success rule still approximated "client-visible success" with "the handler returned". Two windows violated the invariant, plus two cross-transport divergences on the modern path:

  1. Era overwrite (confirmed live): the era commits ran unconditionally after their own request, so a pipelined envelope-bearing request completing after an inline initialize overwrote the committed legacy lock — bricking the freshly-handshaked client — and the mirror direction (a legacy commit clobbering a modern lock) was possible too. The era cell is now monotone: one era_settles settlement predicate with site-local atomic commits, first success wins, and a straggler from the other era can never move a committed lock. Its response still stands — it was already earned.
  2. Cancelled-away success: with the dispatcher's interrupt cancel mode, a peer cancel landing during the handler's shielded teardown is delivered at the response-write checkpoint — the client sees "Request cancelled", but the era had already locked. The settlement predicate now skips the lock when the request's cancel event is set, tying "success" to what the client actually observes. (A cancel racing a genuinely blocked response write can still slip through; that residue belongs to the dispatcher's existing respond-after-cancel divergence and dissolves with it.)
  3. Exception-boundary parity: an unmapped handler exception on a modern stream request fell through to the dispatcher's code-0 catch-all — str(e) on the wire — while the identical request over modern HTTP produced the generic -32603 Internal server error. The code-0 pin exists for v1 compat, which the modern era doesn't have. Both modern entries now share one modern_error_data boundary; debug raise_exceptions still re-raises.
  4. Envelope-coercion parity: the HTTP entry degraded mis-shaped clientInfo/clientCapabilities to not-supplied while the stream path rejected them at Connection.from_envelope construction. For spec methods both transports already rejected identically (the kernel's per-version params surface types the reserved _meta keys strictly); custom methods diverged. from_envelope now takes the classifier's raw values and owns the coercion, so every modern entry — HTTP, stream loop, and modern_on_request — builds connection identity identically.
  5. Classifier guard placement: the non-string protocol-version guard moved below the header rung, and (review follow-up) the rung now checks the version header's presence explicitly — a null body version would otherwise slip the None == None equality and make the absent-header outcome depend on the body value. An absent or disagreeing version header is uniformly HEADER_MISMATCH, the HTTP wire is untouched wherever it was well-defined, and the string guard is reachable only on header-less transports, making the round-1 "stdio-only delta" claim above exactly true. (Through the shipped HTTP manager the era routing is header-only, so these cells harden the modern entry's own contract.) The public Server.run docstring also picked up the first-success wording.

Each behavioral fix carries a regression test that failed on the round-1 code (the overwrite and cancelled-away tests reproduce the races deterministically), and the scenarios were re-driven against a live stdio subprocess: the pipelined-race brick, the sanitized internal error, the failed-opener fallback, and the cancel-never-locks path all behave as documented, with auto/legacy negotiation unchanged.

AI Disclaimer

Post-merge review follow-up to #3038, fixing the cases where a failed
request could still lock the connection's era:

- The modern era now locks only when a request SUCCEEDS, mirroring the
  legacy side's lock-on-success rule. Previously the lock committed at
  classification time, so a first request with a well-formed envelope
  but malformed content - or subscriptions/listen, or an unknown
  method - locked the connection modern after failing; the client's
  initialize fallback then got -32022, the one code auto-negotiating
  clients do not fall back from, stranding the connection for its
  lifetime (a whole subprocess on stdio).
- initialize is legacy-distinctive by definition (the method does not
  exist at modern versions), so it always takes the handshake path now,
  even when a confused client stamps the envelope triple on it.
  Previously it routed modern, failed METHOD_NOT_FOUND, and locked the
  era modern - bricking the handshake the same way.
- A bare server/discover on a legacy-locked connection falls through to
  the loop runner's per-version surface validation again, restoring the
  byte-identical METHOD_NOT_FOUND a handshake-only server produced;
  only envelope-bearing frames get the INVALID_REQUEST rejection (a
  conforming legacy client can never send the reserved triple).
- classify_inbound_request rejects a present-but-non-string protocol
  version as a descriptive INVALID_PARAMS instead of raising
  ValidationError out of its own -32022 payload construction
  (`requested` is a str field); the dispatcher's exception ladder
  happened to mask that as a generic -32602.
- _NoServerRequestsDispatchContext masks transport.can_send_request so
  the per-message transport metadata agrees with the wrapper's denial,
  matching the modern HTTP entry's context.
- NotifyOnlyOutbound inherits _NoChannelOutbound's refusal instead of
  duplicating it, and the has_standalone_channel / Raises contracts now
  document that a modern connection refuses requests on a real channel.
- Docs: the migration guide and protocol-versions page note the
  discover probe is transport-independent; the subscriptions page
  documents the streamable-HTTP-only limitation of subscriptions/listen;
  serve_loop's stale shared-recipe claim is corrected.

Also replaces two pyright suppressions in the stdio test doubles with
typeshed-compatible Buffer signatures.
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

📚 Documentation preview

Preview https://pr-3040.mcp-python-docs.pages.dev
Deployment https://3900e83a.mcp-python-docs.pages.dev
Commit 482d7cd
Triggered by @maxisbey
Updated 2026-07-01 14:20:47 UTC

@maxisbey maxisbey marked this pull request as ready for review July 1, 2026 09:08

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread src/mcp/server/runner.py Outdated
Comment thread src/mcp/shared/inbound.py Outdated
Comment thread src/mcp/server/runner.py Outdated
The dual-era loop's lock-on-success rule approximated "client-visible
success" with "the handler returned", leaving windows where a request the
client saw fail still moved the era: a pipelined envelope-bearing request
completing after an inline initialize overwrote the committed legacy lock
(stranding the freshly-handshaked client), and a peer cancel landing
during the handler's shielded teardown locked the era while the client
received "Request cancelled". Era commits now run through a single
era_settles predicate: monotone (the first success wins, a straggler from
the other era can never move a committed lock) and skipped when a pending
cancel is about to replace the response.

Two modern-path divergences between the stream loop and the HTTP entry
close by sharing one pipeline: Connection.from_envelope takes the
classifier's raw envelope values and owns the tolerant coercion (custom
methods now degrade mis-shaped clientInfo identically everywhere), and a
shared modern_error_data boundary maps unmapped handler exceptions to a
generic INTERNAL_ERROR instead of the legacy code-0 catch-all putting
str(e) on the wire.

The classifier's non-string protocol-version guard moves below the header
rung, so the HTTP wire is untouched whenever the version header is
present, while the guard still backstops the header-absent null-version
corner that previously escaped as a ValidationError.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread src/mcp/shared/inbound.py
Comment thread src/mcp/server/runner.py
A null body protocolVersion slipped the header equality check (None ==
None), so the absent-header outcome depended on the body value: an int
gave HEADER_MISMATCH while null fell through to the string guard's
INVALID_PARAMS. The rung now checks presence explicitly, making an
absent version header uniformly HEADER_MISMATCH and the string guard
reachable only on header-less transports. Also updates the public
Server.run docstring to the first-success era-lock wording.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant