test(tutorials): fix flaky streaming test that broke on first terminal event by max-parke-scale · Pull Request #453 · scaleapi/scale-agentex-python

max-parke-scale · 2026-06-29T19:36:49Z

Problem

test_send_event_and_stream_with_reasoning in 010_agent_chat intermittently reds CI on unrelated PRs (e.g. passes on one run, fails minutes later on a sibling PR whose diff is version-strings/changelogs only).

Root cause — test-side early break

A single turn emits several task messages, each terminated by exactly one stream event:

Message	Terminal event
user echo (`adk.messages.create`)	`full`
reasoning (gpt-5, `summary="detailed"`)	`full` (then `close()` is a no-op)
agent text reply (streamed)	`done`

The stream loop broke unconditionally on the first done. When a reasoning message's terminal event reached the consumer before the agent text's done, the loop exited with agent_response_found still False.

The failure signature confirms it is an early break, not latency: the run fails with AssertionError: Agent response not found in stream at tests/test_agent.py:287, not a TimeoutError at the await stream_task line. A latency/timeout problem would surface as the latter. (It also fails at the agent assertion, not the user one, confirming the user echo reliably arrives.)

Fix

Consume terminal events (full and done) until both the user echo and the agent's text reply are observed, keying off the retrieved message's content rather than stopping at the first terminal signal. The 90s timeout remains the backstop if the agent genuinely never responds.

As a bonus, the unified handler now catches the agent text whether it arrives as a full or a done, so it's robust to emission changes.

Why test-side, not `streaming.py`

The producer correctly emits one terminal event per message, and real consumers key by message id (they don't break on the first done). That code path was just repaired in #449 for a duplicate-publish symptom; making reasoning also emit a done risks reintroducing it. The correct and lower-risk layer is the test.

Testing

Full repro requires the scale-agentex server image + Redis + Temporal + real gpt-5 calls, which isn't reproducible in this environment (no OpenAI key). Verified locally: file compiles and ruff check passes. The fix is validated by the failure-signature analysis above rather than a mock-based regression test, which for a tutorial integration test would add more coupling than coverage.

🤖 — posted via Claude Code

Greptile Summary

This PR updates the streaming tutorial test to avoid stopping on the first terminal event. The main changes are:

Handles both full and done terminal events through one path.
Retrieves completed messages before deciding which expected response was seen.
Keeps consuming the stream until both the user echo and agent text reply are found.

Confidence Score: 5/5

Safe to merge; the change is isolated to a tutorial integration test and aligns the stream consumer with the documented multi-message terminal-event behavior.

The update narrows the stopping condition without touching production streaming code, preserving the timeout backstop while making the assertion depend on observed message content.

T-Rex Logs

What T-Rex did

The async harness trex-artifacts/stream-loop-early-terminal-harness.py was executed to run both the before and after scenarios.
The before-run log trex-artifacts/stream-loop-early-terminal-01-before.log captured the base loop, showing it consumed EVENT full user and EVENT done reasoning and retrieved only ['user', 'reasoning'], with agent_response_found=False.
The after-run log trex-artifacts/stream-loop-early-terminal-02-after.log captured the head loop, showing it consumed the subsequent agent terminal event, retrieved ['user', 'reasoning', 'agent'], and ended with user_message_found=True agent_response_found=True reasoning_found=True for agent done and full variants.

_{Ran code and verified through T-Rex}

_{Reviews (2): Last reviewed commit: "test(tutorials): fix flaky streaming tes..." | Re-trigger Greptile}

…l event test_send_event_and_stream_with_reasoning broke out of the stream loop on the first `done` event. A single turn emits several messages — user echo, reasoning, agent text — each ending in a `full` or `done`. When a reasoning message's terminal event arrived before the agent's text `done`, the loop exited with agent_response_found still False, failing at the assertion ("Agent response not found in stream") rather than timing out. The failure signature confirms this: an AssertionError (loop broke early), not a TimeoutError (latency). Consume terminal events until both the user echo and the agent's text reply are seen, keying off message content rather than the first terminal signal. Test-side only: the producer (streaming.py) correctly emits a terminal event per message and is keyed by message id by real consumers; it was just repaired in #449, so the fix stays in the test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread examples/tutorials/run_agent_test.sh

max-parke-scale force-pushed the mparke/priceless-mahavira-abb0ee branch from dba8df3 to 00a4351 Compare June 29, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(tutorials): fix flaky streaming test that broke on first terminal event#453

test(tutorials): fix flaky streaming test that broke on first terminal event#453
max-parke-scale wants to merge 1 commit into
nextfrom
mparke/priceless-mahavira-abb0ee

max-parke-scale commented Jun 29, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

max-parke-scale commented Jun 29, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause — test-side early break

Fix

Why test-side, not streaming.py

Testing

Greptile Summary

Confidence Score: 5/5

T-Rex Logs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

max-parke-scale commented Jun 29, 2026 •

edited by greptile-apps Bot

Loading

Why test-side, not `streaming.py`