feat(webapp,run-engine): queue metrics and health dashboard by ericallam · Pull Request #4131 · triggerdotdev/trigger.dev

ericallam · 2026-07-03T08:23:47Z

Summary

Adds per-queue observability to the Queues page: depth (backlog), throughput (enqueued, started, completed), concurrency, whether a queue is throttled, and the scheduling delay (how long runs wait between becoming eligible and actually starting). Each queue shows health at a glance in the list, plus a per-queue detail page with charts, so you can answer "does this queue have enough concurrency to keep up?".

Both the data collection and the dashboard are off by default and gated independently: metric emission is a global switch, and the dashboard is turned on per organization. With both off, the Queues page is unchanged.

Design

Queue operations emit two kinds of signal. Gauges (depth, running, limit, throttled) are read inside the same Redis script that performs the enqueue or dequeue, so the reading is atomic, and returned on the script's reply for the app to forward. Counters (enqueued, started, completed) are cumulative odometers, so a dropped reading self-heals: the next one restates the running total. Both land on one Redis stream on a dedicated metrics instance (falling back to the run queue's Redis when self-hosting), drain through a consumer into ClickHouse (raw, a 10-second-bucket materialized view, and a 30-day aggregate), and the dashboards read the aggregate. The run queue's own Redis carries no metrics stream.

The one change that is live the moment this deploys, independent of both flags, is the enqueue/dequeue script reply shape: those scripts now return a 2-tuple so the gauge reading can ride back to the app. That path is exercised on every queue op, so it is the part of run-engine worth the closest review.

changeset-bot · 2026-07-03T08:23:52Z

⚠️ No Changeset found

Latest commit: 3c67a0c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-07-03T08:25:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds queue-metrics ingestion, storage, query, and UI support. It introduces a Redis/ClickHouse metrics pipeline package, ClickHouse queue-metrics tables and query helpers, run-queue emission hooks, gap-filling support in TSQL, and new webapp admin, dashboard, list, and detail routes. It also adds environment and feature-flag gating, seed tooling, and tests across the pipeline and query layers.

Related PRs: None found.

Suggested labels: enhancement, area: webapp, area: run-engine, area: internal-packages

Suggested reviewers: ericallam, matt-aitken

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description omits required template sections, including issue link, checklist, testing steps, changelog, and screenshots.	Add the issue reference and fill in the required Checklist, Testing, Changelog, and Screenshots sections per the template.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the main change: queue metrics observability/dashboard work in webapp and run-engine.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/queue-metrics-and-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

devin-ai-integration

Devin Review found 1 potential issue.

pkg-pr-new · 2026-07-04T08:31:49Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@3c67a0c

trigger.dev

npm i https://pkg.pr.new/trigger.dev@3c67a0c

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@3c67a0c

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@3c67a0c

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@3c67a0c

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@3c67a0c

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@3c67a0c

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@3c67a0c

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@3c67a0c

commit: 3c67a0c

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-07-04T08:33:46Z

+      queueMetrics: env.QUEUE_METRICS_EMIT_ENABLED === "1" ? getQueueMetricsEmitter() : undefined,
      processWorkerQueueDebounceMs: env.RUN_ENGINE_PROCESS_WORKER_QUEUE_DEBOUNCE_MS,


🚩 Metrics emitter Redis failure could surface during RunQueue construction

The getQueueMetricsEmitter() at queueMetrics.server.ts:136 creates a Redis client eagerly (via createRedisClient in the MetricsStreamEmitter constructor). This emitter is injected into the RunQueue at runEngine.server.ts:87 during the engine singleton construction. If the metrics Redis (which may be a separate instance per QUEUE_METRICS_REDIS_HOST) is unreachable at boot, the createRedisClient call will emit errors via the onError handler but won't throw synchronously (ioredis reconnects). The emitter's emit() calls are fire-and-forget with .catch(), so a persistently-down metrics Redis won't block the run queue. This is the intended design but worth verifying in a failure scenario.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-07-04T13:25:37Z

+      tailQueues = await this._replica.taskQueue.findMany({
+        where: { ...where, name: { notIn: excludedNames } },


🔴 Queue search filter is silently dropped for non-ranked queues when sorting by activity

The search filter on queue names is overwritten by the exclusion list ({ ...where, name: { notIn: excludedNames } } at QueueListPresenter.server.ts:285), so the tail portion of a sorted page shows queues that don't match the user's search.

Impact: Users who search for a queue name while sorting by "Busiest" or "Backlog" see unrelated queues at the bottom of the page.

Prisma where-clause overwrite mechanism

When a user types a search query (e.g. "email"), buildQueueListWhere at apps/webapp/app/presenters/v3/QueueListPresenter.server.ts:73-91 sets name: { contains: "email", mode: "insensitive" } in the where object. The getRankedQueues method passes this where to the tail query at line 284-292:

tailQueues = await this._replica.taskQueue.findMany({ where: { ...where, name: { notIn: excludedNames } }, ... });

The object spread replaces the name: { contains: ... } filter with name: { notIn: ... }, so the Prisma query no longer filters by the user's search text. The same overwrite pattern exists in findQueuesByNames at line 317 ({ ...where, name: { in: names } }), though there the ClickHouse ranking already applies nameContains so the practical impact is smaller.

The tail query is reached when the current page extends past the ranked (ClickHouse-known) queues into the alphabetical tail of queues with no recent metrics. Those tail queues are fetched from Postgres without the search filter, so they can be any queue in the environment.

Was this helpful? React with 👍 or 👎 to provide feedback.

…peline

…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.

…witch

…counters entryOrderKey returns a string built with BigInt math so ordering stays correct at real epoch magnitudes. Odometer keys are namespaced by definition name. The consumer reports null lag for a missing consumer group instead of 0, and empty gauge values parse as NaN rather than 0.

…ng order keys The wait-time quantile materialized view now excludes wait_ms = 0 rows so it matches the count aggregation. order_key accepts a string or a number. Migration comments no longer contain semicolons that split the migration into invalid statements.

…rride The queues list tolerates a metrics query failure by rendering without metrics and logging a warning. UsageSparkline renders its total override even when every bucket is zero. The queue detail page returns 404 and its loader skips the metrics query when the feature flag is off. The seed script validates bucket size and only writes ClickHouse against a local host.

A bucket-led ORDER BY DESC combined with fillGaps emitted an ascending WITH FILL (positive step, ascending bounds), which produces invalid or empty fills. Skip the gap-fill rewrite for descending orders and let the plain descending query stand. Adds a DESC fillGaps test.

Packs the stream sequence with a 1e6 factor (was 1e5) so up to 1M entries per millisecond per shard fit before a seq could spill into the next millisecond's range, far above what a single Redis stream can produce. ms*1e6 stays within UInt64. Also fixes the webapp mapping test that still expected a numeric order_key after the switch to a BigInt-derived string.

The queues list and queue detail pages now use the shared TimeFilter (any preset period or a custom date range) and everything on the page follows it: header tiles, per queue metric columns, charts, and stats. The custom period buttons, hand rolled chart cards, and duplicated metric fetch loops are replaced by the ChartCard and Chart primitives, UsageSparkline, and a shared useMetricResourceQuery hook. The ClickHouse list queries take an explicit end bound so fixed ranges query only their window.

Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.

The queues list header tiles now render the same line chart, grid, and tooltip as the rest of the metrics charts instead of a row sparkline, with the headline value in the tile header. The env saturation tile draws the environment concurrency limit and burst limit as labeled reference lines. Chart tooltips keep a gap between the series label and the value, and the shared line chart gains showDots and referenceLines options.

Adds an Allocation tab to the Queues page (behind the queue metrics UI flag): overview cards, a burst-aware capacity bar showing each queue allocation and its live usage in a distinct color, an inline-editable limits table with per-queue locks, load-weighted auto-balance, and a review dialog that bulk-applies limits as overrides through the existing concurrency system. The queue list now defaults to Busiest ordering (with Backlog and Name options). ClickHouse ranks queues by activity over the last 15 minutes and returns just the requested page of names, so the cost per page is one small aggregate regardless of environment size; idle queues follow in name order and any failure falls back to name ordering. The classic page keeps plain name order.

The fallback WHERE injection only targeted the top-level SELECT, so a query shaped as an outer aggregation over a FROM subquery failed to compile: the time column only exists inside the subquery. Descend into the subquery so the fallback lands next to the table reference.

Adds two rollups fed from the raw landing table: a per-queue 5-minute tier and an environment-level 1-minute tier (gauges plus TDigest wait quantiles). Ranking now reads the 5m tier and returns the page and the ranked total in one windowed query instead of two scans. The 5m materialized view reads raw rather than cascading off the 10s table: deltaSumTimestamp states hold a single first/last segment, so merging states in an MV's hash-ordered GROUP BY double-counts bridging spans. For the same reason the env tier carries no counter columns, and env-wide counter totals must group by queue before summing.

The built-in queues dashboard's enqueued vs started chart merged counter states across queues, which mixes unrelated cumulative counters and returns wrong totals; it now merges per queue and sums outside. Env header tiles and saturation charts read the environment rollup, so their cost no longer scales with queue count, and coarse-bucket ranges are served from the 5m rollup automatically. Queue list ranking runs as one query, time bounds are aligned to the bucket grid, and repeated auto-refresh reads share ClickHouse query-cache entries.

… rollup The env rollup's win comes from dropping the queue dimension, not from coarser buckets: row count is queue-independent (~8640/day/env), so full 10-second granularity stays cheap at any range. Env header tiles and saturation charts now resolve short-range detail exactly like the per-queue charts, and the current-value tiles read the latest 10-second bucket instead of a minute-wide one.

The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.

Counter events now emit per queue and op odometer readings with a seeded zero baseline, matching the production emitter, so throughput and started counts reconstruct from simulated data instead of reading zero. Scenario switches prune the previous scenario's queues, a --project flag seeds each scenario into its own project for side-by-side design review, and a new many-queues scenario covers pagination and relevance ranking with one runaway queue, a busy head, a bursty middle, and a sparse tail. Adds --help.

A --usage flag stages plausible running counts in the local run-queue Redis for the seeded queues, so the list's Running column and the Allocation tab's usage bars have data without the run engine. Staged state is reconciled on every run: present with --usage, cleared without. Local Redis hosts only.

This comment was marked as resolved.

Sign in to view

ericallam marked this pull request as ready for review July 3, 2026 10:26

devin-ai-integration Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread internal-packages/clickhouse/schema/035_create_queue_metrics_v1.sql

This comment was marked as resolved.

Sign in to view

ericallam force-pushed the feat/queue-metrics-and-health branch from a892684 to 9412bf5 Compare July 4, 2026 08:30

devin-ai-integration Bot reviewed Jul 4, 2026

View reviewed changes

ericallam added 17 commits July 4, 2026 23:14

feat(metrics-pipeline): generic Redis-stream to ClickHouse metrics pi…

c0d5a54

…peline

feat(clickhouse): queue metrics tables and read queries

09ab1b1

feat(run-engine): emit queue depth, throughput, and scheduling-delay …

c32a1cd

…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.

feat(webapp): queue metrics ingestion, admin controls, and emission s…

2d3d32d

…witch

feat(tsql): opt-in gap-fill for time-bucketed series

305e9a6

feat(webapp): Queues dashboard and per-org metrics UI flag

982be50

chore(webapp): add server-changes note for queue metrics

0143080

chore: apply oxfmt formatting

d193dc3

chore: use import type for type-only imports

bcb017d

fix(tsql): avoid polynomial backtracking in ORDER BY direction strip

946a24d

fix(tsql): strip ORDER BY direction without a backtracking regex

06bb476

fix(clickhouse): remove semicolons from queue metrics migration comments

e848045

test(clickhouse): rewrite queue metrics test for cumulative counters

c7befb3

test(run-engine): import describe from vitest in run-queue metrics test

4b465b7

ericallam added 15 commits July 4, 2026 23:14

fix(tsql): register the deltaSumTimestampMerge aggregate

fa40e59

Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.

chore(webapp): use shared primitives on the admin queue metrics page

79ca47f

feat(clickhouse): queue activity ranking queries

ec4d032

fix(webapp): include rollup tables in the queue metrics simulator reset

efdd64f

The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.

ericallam force-pushed the feat/queue-metrics-and-health branch from 6432d9f to 3c67a0c Compare July 4, 2026 22:16

		queueMetrics: env.QUEUE_METRICS_EMIT_ENABLED === "1" ? getQueueMetricsEmitter() : undefined,
		processWorkerQueueDebounceMs: env.RUN_ENGINE_PROCESS_WORKER_QUEUE_DEBOUNCE_MS,

		tailQueues = await this._replica.taskQueue.findMany({
		where: { ...where, name: { notIn: excludedNames } },

Uh oh!

Uh oh!

Conversation

ericallam commented Jul 3, 2026

Summary

Design

Uh oh!

changeset-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot Bot commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

pkg-pr-new Bot commented Jul 4, 2026 •

edited

Loading

devin-ai-integration Bot Jul 4, 2026 •

edited

Loading