Skip to content

test(tutorials): fix flaky streaming test that broke on first terminal event#453

Open
max-parke-scale wants to merge 1 commit into
nextfrom
mparke/priceless-mahavira-abb0ee
Open

test(tutorials): fix flaky streaming test that broke on first terminal event#453
max-parke-scale wants to merge 1 commit into
nextfrom
mparke/priceless-mahavira-abb0ee

Conversation

@max-parke-scale

@max-parke-scale max-parke-scale commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

test_send_event_and_stream_with_reasoning in 010_agent_chat intermittently reds CI on unrelated PRs (e.g. passes on one run, fails minutes later on a sibling PR whose diff is version-strings/changelogs only).

Root cause — test-side early break

A single turn emits several task messages, each terminated by exactly one stream event:

Message Terminal event
user echo (adk.messages.create) full
reasoning (gpt-5, summary="detailed") full (then close() is a no-op)
agent text reply (streamed) done

The stream loop broke unconditionally on the first done. When a reasoning message's terminal event reached the consumer before the agent text's done, the loop exited with agent_response_found still False.

The failure signature confirms it is an early break, not latency: the run fails with AssertionError: Agent response not found in stream at tests/test_agent.py:287, not a TimeoutError at the await stream_task line. A latency/timeout problem would surface as the latter. (It also fails at the agent assertion, not the user one, confirming the user echo reliably arrives.)

Fix

Consume terminal events (full and done) until both the user echo and the agent's text reply are observed, keying off the retrieved message's content rather than stopping at the first terminal signal. The 90s timeout remains the backstop if the agent genuinely never responds.

As a bonus, the unified handler now catches the agent text whether it arrives as a full or a done, so it's robust to emission changes.

Why test-side, not streaming.py

The producer correctly emits one terminal event per message, and real consumers key by message id (they don't break on the first done). That code path was just repaired in #449 for a duplicate-publish symptom; making reasoning also emit a done risks reintroducing it. The correct and lower-risk layer is the test.

Testing

Full repro requires the scale-agentex server image + Redis + Temporal + real gpt-5 calls, which isn't reproducible in this environment (no OpenAI key). Verified locally: file compiles and ruff check passes. The fix is validated by the failure-signature analysis above rather than a mock-based regression test, which for a tutorial integration test would add more coupling than coverage.

🤖 — posted via Claude Code

Greptile Summary

This PR updates the streaming tutorial test to avoid stopping on the first terminal event. The main changes are:

  • Handles both full and done terminal events through one path.
  • Retrieves completed messages before deciding which expected response was seen.
  • Keeps consuming the stream until both the user echo and agent text reply are found.

Confidence Score: 5/5

Safe to merge; the change is isolated to a tutorial integration test and aligns the stream consumer with the documented multi-message terminal-event behavior.

The update narrows the stopping condition without touching production streaming code, preserving the timeout backstop while making the assertion depend on observed message content.

T-Rex T-Rex Logs

What T-Rex did

  • The async harness trex-artifacts/stream-loop-early-terminal-harness.py was executed to run both the before and after scenarios.
  • The before-run log trex-artifacts/stream-loop-early-terminal-01-before.log captured the base loop, showing it consumed EVENT full user and EVENT done reasoning and retrieved only ['user', 'reasoning'], with agent_response_found=False.
  • The after-run log trex-artifacts/stream-loop-early-terminal-02-after.log captured the head loop, showing it consumed the subsequent agent terminal event, retrieved ['user', 'reasoning', 'agent'], and ended with user_message_found=True agent_response_found=True reasoning_found=True for agent done and full variants.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (2): Last reviewed commit: "test(tutorials): fix flaky streaming tes..." | Re-trigger Greptile

Comment thread examples/tutorials/run_agent_test.sh
…l event

test_send_event_and_stream_with_reasoning broke out of the stream loop on
the first `done` event. A single turn emits several messages — user echo,
reasoning, agent text — each ending in a `full` or `done`. When a reasoning
message's terminal event arrived before the agent's text `done`, the loop
exited with agent_response_found still False, failing at the assertion
("Agent response not found in stream") rather than timing out.

The failure signature confirms this: an AssertionError (loop broke early),
not a TimeoutError (latency). Consume terminal events until both the user
echo and the agent's text reply are seen, keying off message content rather
than the first terminal signal.

Test-side only: the producer (streaming.py) correctly emits a terminal
event per message and is keyed by message id by real consumers; it was
just repaired in #449, so the fix stays in the test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@max-parke-scale max-parke-scale force-pushed the mparke/priceless-mahavira-abb0ee branch from dba8df3 to 00a4351 Compare June 29, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant