feat(webapp,run-engine): queue metrics and health dashboard#4131
feat(webapp,run-engine): queue metrics and health dashboard#4131ericallam wants to merge 32 commits into
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds queue-metrics ingestion, storage, query, and UI support. It introduces a Redis/ClickHouse metrics pipeline package, ClickHouse queue-metrics tables and query helpers, run-queue emission hooks, gap-filling support in TSQL, and new webapp admin, dashboard, list, and detail routes. It also adds environment and feature-flag gating, seed tooling, and tests across the pipeline and query layers. Related PRs: None found. Suggested labels: enhancement, area: webapp, area: run-engine, area: internal-packages Suggested reviewers: ericallam, matt-aitken 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a892684 to
9412bf5
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
| queueMetrics: env.QUEUE_METRICS_EMIT_ENABLED === "1" ? getQueueMetricsEmitter() : undefined, | ||
| processWorkerQueueDebounceMs: env.RUN_ENGINE_PROCESS_WORKER_QUEUE_DEBOUNCE_MS, |
There was a problem hiding this comment.
🚩 Metrics emitter Redis failure could surface during RunQueue construction
The getQueueMetricsEmitter() at queueMetrics.server.ts:136 creates a Redis client eagerly (via createRedisClient in the MetricsStreamEmitter constructor). This emitter is injected into the RunQueue at runEngine.server.ts:87 during the engine singleton construction. If the metrics Redis (which may be a separate instance per QUEUE_METRICS_REDIS_HOST) is unreachable at boot, the createRedisClient call will emit errors via the onError handler but won't throw synchronously (ioredis reconnects). The emitter's emit() calls are fire-and-forget with .catch(), so a persistently-down metrics Redis won't block the run queue. This is the intended design but worth verifying in a failure scenario.
Was this helpful? React with 👍 or 👎 to provide feedback.
| tailQueues = await this._replica.taskQueue.findMany({ | ||
| where: { ...where, name: { notIn: excludedNames } }, |
There was a problem hiding this comment.
🔴 Queue search filter is silently dropped for non-ranked queues when sorting by activity
The search filter on queue names is overwritten by the exclusion list ({ ...where, name: { notIn: excludedNames } } at QueueListPresenter.server.ts:285), so the tail portion of a sorted page shows queues that don't match the user's search.
Impact: Users who search for a queue name while sorting by "Busiest" or "Backlog" see unrelated queues at the bottom of the page.
Prisma where-clause overwrite mechanism
When a user types a search query (e.g. "email"), buildQueueListWhere at apps/webapp/app/presenters/v3/QueueListPresenter.server.ts:73-91 sets name: { contains: "email", mode: "insensitive" } in the where object. The getRankedQueues method passes this where to the tail query at line 284-292:
tailQueues = await this._replica.taskQueue.findMany({
where: { ...where, name: { notIn: excludedNames } },
...
});The object spread replaces the name: { contains: ... } filter with name: { notIn: ... }, so the Prisma query no longer filters by the user's search text. The same overwrite pattern exists in findQueuesByNames at line 317 ({ ...where, name: { in: names } }), though there the ClickHouse ranking already applies nameContains so the practical impact is smaller.
The tail query is reached when the current page extends past the ranked (ClickHouse-known) queues into the alphabetical tail of queues with no recent metrics. Those tail queues are fetched from Postgres without the search filter, so they can be any queue in the environment.
Was this helpful? React with 👍 or 👎 to provide feedback.
…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.
…counters entryOrderKey returns a string built with BigInt math so ordering stays correct at real epoch magnitudes. Odometer keys are namespaced by definition name. The consumer reports null lag for a missing consumer group instead of 0, and empty gauge values parse as NaN rather than 0.
…ng order keys The wait-time quantile materialized view now excludes wait_ms = 0 rows so it matches the count aggregation. order_key accepts a string or a number. Migration comments no longer contain semicolons that split the migration into invalid statements.
…rride The queues list tolerates a metrics query failure by rendering without metrics and logging a warning. UsageSparkline renders its total override even when every bucket is zero. The queue detail page returns 404 and its loader skips the metrics query when the feature flag is off. The seed script validates bucket size and only writes ClickHouse against a local host.
A bucket-led ORDER BY DESC combined with fillGaps emitted an ascending WITH FILL (positive step, ascending bounds), which produces invalid or empty fills. Skip the gap-fill rewrite for descending orders and let the plain descending query stand. Adds a DESC fillGaps test.
Packs the stream sequence with a 1e6 factor (was 1e5) so up to 1M entries per millisecond per shard fit before a seq could spill into the next millisecond's range, far above what a single Redis stream can produce. ms*1e6 stays within UInt64. Also fixes the webapp mapping test that still expected a numeric order_key after the switch to a BigInt-derived string.
The queues list and queue detail pages now use the shared TimeFilter (any preset period or a custom date range) and everything on the page follows it: header tiles, per queue metric columns, charts, and stats. The custom period buttons, hand rolled chart cards, and duplicated metric fetch loops are replaced by the ChartCard and Chart primitives, UsageSparkline, and a shared useMetricResourceQuery hook. The ClickHouse list queries take an explicit end bound so fixed ranges query only their window.
Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.
The queues list header tiles now render the same line chart, grid, and tooltip as the rest of the metrics charts instead of a row sparkline, with the headline value in the tile header. The env saturation tile draws the environment concurrency limit and burst limit as labeled reference lines. Chart tooltips keep a gap between the series label and the value, and the shared line chart gains showDots and referenceLines options.
Adds an Allocation tab to the Queues page (behind the queue metrics UI flag): overview cards, a burst-aware capacity bar showing each queue allocation and its live usage in a distinct color, an inline-editable limits table with per-queue locks, load-weighted auto-balance, and a review dialog that bulk-applies limits as overrides through the existing concurrency system. The queue list now defaults to Busiest ordering (with Backlog and Name options). ClickHouse ranks queues by activity over the last 15 minutes and returns just the requested page of names, so the cost per page is one small aggregate regardless of environment size; idle queues follow in name order and any failure falls back to name ordering. The classic page keeps plain name order.
The fallback WHERE injection only targeted the top-level SELECT, so a query shaped as an outer aggregation over a FROM subquery failed to compile: the time column only exists inside the subquery. Descend into the subquery so the fallback lands next to the table reference.
Adds two rollups fed from the raw landing table: a per-queue 5-minute tier and an environment-level 1-minute tier (gauges plus TDigest wait quantiles). Ranking now reads the 5m tier and returns the page and the ranked total in one windowed query instead of two scans. The 5m materialized view reads raw rather than cascading off the 10s table: deltaSumTimestamp states hold a single first/last segment, so merging states in an MV's hash-ordered GROUP BY double-counts bridging spans. For the same reason the env tier carries no counter columns, and env-wide counter totals must group by queue before summing.
The built-in queues dashboard's enqueued vs started chart merged counter states across queues, which mixes unrelated cumulative counters and returns wrong totals; it now merges per queue and sums outside. Env header tiles and saturation charts read the environment rollup, so their cost no longer scales with queue count, and coarse-bucket ranges are served from the 5m rollup automatically. Queue list ranking runs as one query, time bounds are aligned to the bucket grid, and repeated auto-refresh reads share ClickHouse query-cache entries.
… rollup The env rollup's win comes from dropping the queue dimension, not from coarser buckets: row count is queue-independent (~8640/day/env), so full 10-second granularity stays cheap at any range. Env header tiles and saturation charts now resolve short-range detail exactly like the per-queue charts, and the current-value tiles read the latest 10-second bucket instead of a minute-wide one.
The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.
Counter events now emit per queue and op odometer readings with a seeded zero baseline, matching the production emitter, so throughput and started counts reconstruct from simulated data instead of reading zero. Scenario switches prune the previous scenario's queues, a --project flag seeds each scenario into its own project for side-by-side design review, and a new many-queues scenario covers pagination and relevance ranking with one runaway queue, a busy head, a bursty middle, and a sparse tail. Adds --help.
A --usage flag stages plausible running counts in the local run-queue Redis for the seeded queues, so the list's Running column and the Allocation tab's usage bars have data without the run engine. Staged state is reconciled on every run: present with --usage, cleared without. Local Redis hosts only.
6432d9f to
3c67a0c
Compare
Summary
Adds per-queue observability to the Queues page: depth (backlog), throughput (enqueued, started, completed), concurrency, whether a queue is throttled, and the scheduling delay (how long runs wait between becoming eligible and actually starting). Each queue shows health at a glance in the list, plus a per-queue detail page with charts, so you can answer "does this queue have enough concurrency to keep up?".
Both the data collection and the dashboard are off by default and gated independently: metric emission is a global switch, and the dashboard is turned on per organization. With both off, the Queues page is unchanged.
Design
Queue operations emit two kinds of signal. Gauges (depth, running, limit, throttled) are read inside the same Redis script that performs the enqueue or dequeue, so the reading is atomic, and returned on the script's reply for the app to forward. Counters (enqueued, started, completed) are cumulative odometers, so a dropped reading self-heals: the next one restates the running total. Both land on one Redis stream on a dedicated metrics instance (falling back to the run queue's Redis when self-hosting), drain through a consumer into ClickHouse (raw, a 10-second-bucket materialized view, and a 30-day aggregate), and the dashboards read the aggregate. The run queue's own Redis carries no metrics stream.
The one change that is live the moment this deploys, independent of both flags, is the enqueue/dequeue script reply shape: those scripts now return a 2-tuple so the gauge reading can ride back to the app. That path is exercised on every queue op, so it is the part of
run-engineworth the closest review.