Skip to content

feat(webapp,run-engine): queue metrics and health dashboard#4131

Open
ericallam wants to merge 32 commits into
mainfrom
feat/queue-metrics-and-health
Open

feat(webapp,run-engine): queue metrics and health dashboard#4131
ericallam wants to merge 32 commits into
mainfrom
feat/queue-metrics-and-health

Conversation

@ericallam

Copy link
Copy Markdown
Member

Summary

Adds per-queue observability to the Queues page: depth (backlog), throughput (enqueued, started, completed), concurrency, whether a queue is throttled, and the scheduling delay (how long runs wait between becoming eligible and actually starting). Each queue shows health at a glance in the list, plus a per-queue detail page with charts, so you can answer "does this queue have enough concurrency to keep up?".

Both the data collection and the dashboard are off by default and gated independently: metric emission is a global switch, and the dashboard is turned on per organization. With both off, the Queues page is unchanged.

Design

Queue operations emit two kinds of signal. Gauges (depth, running, limit, throttled) are read inside the same Redis script that performs the enqueue or dequeue, so the reading is atomic, and returned on the script's reply for the app to forward. Counters (enqueued, started, completed) are cumulative odometers, so a dropped reading self-heals: the next one restates the running total. Both land on one Redis stream on a dedicated metrics instance (falling back to the run queue's Redis when self-hosting), drain through a consumer into ClickHouse (raw, a 10-second-bucket materialized view, and a 30-day aggregate), and the dashboards read the aggregate. The run queue's own Redis carries no metrics stream.

The one change that is live the moment this deploys, independent of both flags, is the enqueue/dequeue script reply shape: those scripts now return a 2-tuple so the gauge reading can ride back to the app. That path is exercised on every queue op, so it is the part of run-engine worth the closest review.

@changeset-bot

changeset-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 3c67a0c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds queue-metrics ingestion, storage, query, and UI support. It introduces a Redis/ClickHouse metrics pipeline package, ClickHouse queue-metrics tables and query helpers, run-queue emission hooks, gap-filling support in TSQL, and new webapp admin, dashboard, list, and detail routes. It also adds environment and feature-flag gating, seed tooling, and tests across the pipeline and query layers.

Related PRs: None found.

Suggested labels: enhancement, area: webapp, area: run-engine, area: internal-packages

Suggested reviewers: ericallam, matt-aitken

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description omits required template sections, including issue link, checklist, testing steps, changelog, and screenshots. Add the issue reference and fill in the required Checklist, Testing, Changelog, and Screenshots sections per the template.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main change: queue metrics observability/dashboard work in webapp and run-engine.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/queue-metrics-and-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

github-advanced-security[bot]

This comment was marked as resolved.

github-advanced-security[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam marked this pull request as ready for review July 3, 2026 10:26

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment thread internal-packages/clickhouse/schema/035_create_queue_metrics_v1.sql
coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the feat/queue-metrics-and-health branch from a892684 to 9412bf5 Compare July 4, 2026 08:30
@pkg-pr-new

pkg-pr-new Bot commented Jul 4, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@3c67a0c

trigger.dev

npm i https://pkg.pr.new/trigger.dev@3c67a0c

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@3c67a0c

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@3c67a0c

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@3c67a0c

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@3c67a0c

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@3c67a0c

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@3c67a0c

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@3c67a0c

commit: 3c67a0c

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment on lines +87 to 88
queueMetrics: env.QUEUE_METRICS_EMIT_ENABLED === "1" ? getQueueMetricsEmitter() : undefined,
processWorkerQueueDebounceMs: env.RUN_ENGINE_PROCESS_WORKER_QUEUE_DEBOUNCE_MS,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Metrics emitter Redis failure could surface during RunQueue construction

The getQueueMetricsEmitter() at queueMetrics.server.ts:136 creates a Redis client eagerly (via createRedisClient in the MetricsStreamEmitter constructor). This emitter is injected into the RunQueue at runEngine.server.ts:87 during the engine singleton construction. If the metrics Redis (which may be a separate instance per QUEUE_METRICS_REDIS_HOST) is unreachable at boot, the createRedisClient call will emit errors via the onError handler but won't throw synchronously (ioredis reconnects). The emitter's emit() calls are fire-and-forget with .catch(), so a persistently-down metrics Redis won't block the run queue. This is the intended design but worth verifying in a failure scenario.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment on lines +277 to +278
tailQueues = await this._replica.taskQueue.findMany({
where: { ...where, name: { notIn: excludedNames } },

@devin-ai-integration devin-ai-integration Bot Jul 4, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Queue search filter is silently dropped for non-ranked queues when sorting by activity

The search filter on queue names is overwritten by the exclusion list ({ ...where, name: { notIn: excludedNames } } at QueueListPresenter.server.ts:285), so the tail portion of a sorted page shows queues that don't match the user's search.

Impact: Users who search for a queue name while sorting by "Busiest" or "Backlog" see unrelated queues at the bottom of the page.

Prisma where-clause overwrite mechanism

When a user types a search query (e.g. "email"), buildQueueListWhere at apps/webapp/app/presenters/v3/QueueListPresenter.server.ts:73-91 sets name: { contains: "email", mode: "insensitive" } in the where object. The getRankedQueues method passes this where to the tail query at line 284-292:

tailQueues = await this._replica.taskQueue.findMany({
  where: { ...where, name: { notIn: excludedNames } },
  ...
});

The object spread replaces the name: { contains: ... } filter with name: { notIn: ... }, so the Prisma query no longer filters by the user's search text. The same overwrite pattern exists in findQueuesByNames at line 317 ({ ...where, name: { in: names } }), though there the ClickHouse ranking already applies nameContains so the practical impact is smaller.

The tail query is reached when the current page extends past the ranked (ClickHouse-known) queues into the alphabetical tail of queues with no recent metrics. Those tail queues are fetched from Postgres without the search filter, so they can be any queue in the environment.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

ericallam added 17 commits July 4, 2026 23:14
…signals

Gauges are read inside the enqueue/dequeue Lua and returned on the script reply
as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no
metrics stream of its own.
…counters

entryOrderKey returns a string built with BigInt math so ordering stays correct at real epoch magnitudes. Odometer keys are namespaced by definition name. The consumer reports null lag for a missing consumer group instead of 0, and empty gauge values parse as NaN rather than 0.
…ng order keys

The wait-time quantile materialized view now excludes wait_ms = 0 rows so it matches the count aggregation. order_key accepts a string or a number. Migration comments no longer contain semicolons that split the migration into invalid statements.
…rride

The queues list tolerates a metrics query failure by rendering without metrics and logging a warning. UsageSparkline renders its total override even when every bucket is zero. The queue detail page returns 404 and its loader skips the metrics query when the feature flag is off. The seed script validates bucket size and only writes ClickHouse against a local host.
ericallam added 15 commits July 4, 2026 23:14
A bucket-led ORDER BY DESC combined with fillGaps emitted an ascending WITH FILL (positive step, ascending bounds), which produces invalid or empty fills. Skip the gap-fill rewrite for descending orders and let the plain descending query stand. Adds a DESC fillGaps test.
Packs the stream sequence with a 1e6 factor (was 1e5) so up to 1M entries per millisecond per shard fit before a seq could spill into the next millisecond's range, far above what a single Redis stream can produce. ms*1e6 stays within UInt64. Also fixes the webapp mapping test that still expected a numeric order_key after the switch to a BigInt-derived string.
The queues list and queue detail pages now use the shared TimeFilter (any preset period or a custom date range) and everything on the page follows it: header tiles, per queue metric columns, charts, and stats. The custom period buttons, hand rolled chart cards, and duplicated metric fetch loops are replaced by the ChartCard and Chart primitives, UsageSparkline, and a shared useMetricResourceQuery hook. The ClickHouse list queries take an explicit end bound so fixed ranges query only their window.
Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.
The queues list header tiles now render the same line chart, grid, and tooltip as the rest of the metrics charts instead of a row sparkline, with the headline value in the tile header. The env saturation tile draws the environment concurrency limit and burst limit as labeled reference lines. Chart tooltips keep a gap between the series label and the value, and the shared line chart gains showDots and referenceLines options.
Adds an Allocation tab to the Queues page (behind the queue metrics UI flag): overview cards, a burst-aware capacity bar showing each queue allocation and its live usage in a distinct color, an inline-editable limits table with per-queue locks, load-weighted auto-balance, and a review dialog that bulk-applies limits as overrides through the existing concurrency system.

The queue list now defaults to Busiest ordering (with Backlog and Name options). ClickHouse ranks queues by activity over the last 15 minutes and returns just the requested page of names, so the cost per page is one small aggregate regardless of environment size; idle queues follow in name order and any failure falls back to name ordering. The classic page keeps plain name order.
The fallback WHERE injection only targeted the top-level SELECT, so a
query shaped as an outer aggregation over a FROM subquery failed to
compile: the time column only exists inside the subquery. Descend into
the subquery so the fallback lands next to the table reference.
Adds two rollups fed from the raw landing table: a per-queue 5-minute
tier and an environment-level 1-minute tier (gauges plus TDigest wait
quantiles). Ranking now reads the 5m tier and returns the page and the
ranked total in one windowed query instead of two scans.

The 5m materialized view reads raw rather than cascading off the 10s
table: deltaSumTimestamp states hold a single first/last segment, so
merging states in an MV's hash-ordered GROUP BY double-counts bridging
spans. For the same reason the env tier carries no counter columns, and
env-wide counter totals must group by queue before summing.
The built-in queues dashboard's enqueued vs started chart merged counter
states across queues, which mixes unrelated cumulative counters and
returns wrong totals; it now merges per queue and sums outside. Env
header tiles and saturation charts read the environment rollup, so their
cost no longer scales with queue count, and coarse-bucket ranges are
served from the 5m rollup automatically. Queue list ranking runs as one
query, time bounds are aligned to the bucket grid, and repeated
auto-refresh reads share ClickHouse query-cache entries.
… rollup

The env rollup's win comes from dropping the queue dimension, not from
coarser buckets: row count is queue-independent (~8640/day/env), so full
10-second granularity stays cheap at any range. Env header tiles and
saturation charts now resolve short-range detail exactly like the
per-queue charts, and the current-value tiles read the latest 10-second
bucket instead of a minute-wide one.
The simulator's --reset only cleared the raw and 10s tables, leaving
stale rows in the 5m and env rollups. It also force-merges the rollups
after seeding so current-value widgets read cleanly.
Counter events now emit per queue and op odometer readings with a seeded
zero baseline, matching the production emitter, so throughput and
started counts reconstruct from simulated data instead of reading zero.
Scenario switches prune the previous scenario's queues, a --project flag
seeds each scenario into its own project for side-by-side design review,
and a new many-queues scenario covers pagination and relevance ranking
with one runaway queue, a busy head, a bursty middle, and a sparse tail.
Adds --help.
A --usage flag stages plausible running counts in the local run-queue
Redis for the seeded queues, so the list's Running column and the
Allocation tab's usage bars have data without the run engine. Staged
state is reconciled on every run: present with --usage, cleared without.
Local Redis hosts only.
@ericallam ericallam force-pushed the feat/queue-metrics-and-health branch from 6432d9f to 3c67a0c Compare July 4, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants