feat: renew in-flight job timestamps via a worker heartbeat by mattstrayer · Pull Request #17 · boringnode/queue

mattstrayer · 2026-06-24T16:50:23Z

Closes #16.

Companion PR for the long-running-job double-execution issue. As described there, when a handler runs longer than stalledThreshold, nothing ever refreshes acquiredAt between claim and completion, so the stalled-recovery path re-delivers the job to a free slot and it runs a second time, concurrently. stalledThreshold effectively becomes a hard cap on how long a job may run rather than a crash-detection window.

What this does

Adds a heartbeat that periodically renews the acquired timestamp of the jobs a worker is actively processing.

renewJobs(queue, jobIds) on the Adapter contract, implemented for every backend:
- Redis — new RENEW_JOBS_SCRIPT Lua that HSETs acquiredAt on the active hash, preserving workerId.
- Knex — UPDATE ... WHERE status = 'active' AND id IN (...).
- Fake / Sync adapters (and the in-memory test mock).
- Only entries still active are renewed, so a job that was already recovered or finalized is never resurrected by a late heartbeat.
A dedicated worker setInterval (~stalledThreshold / 2) that renews the in-flight job ids. It has to be a separate timer rather than piggybacking on the process loop: at full concurrency the loop blocks on waitForNextCompletion() with no idle tick, so the loop is not cycling exactly when long jobs are in flight.
The heartbeat is cleared in stop() (after draining, so jobs still finishing keep being renewed until they complete) as well as in the process() generator's finally, giving deterministic cleanup whether the worker is driven via start() or processCycle(). #startHeartbeat is idempotent so the timer can't leak if the loop is re-entered.

Net effect: "stalled" once again means the worker actually died, so stalledThreshold can stay small without re-delivering healthy long-running jobs.

Tests

renewJobs behavior across all adapters via the shared driver suite: renewal keeps an active job from being recovered, never resurrects an already-recovered job, is queue-scoped, and is a no-op for an empty id list.
Worker-level: a long-running job at full capacity is kept alive by the heartbeat and executes exactly once (verified against a negative control where the double-execution reproduces without renewal), and the heartbeat stops firing once the worker is stopped.

All 647 tests pass locally across Memory, Redis, Knex (SQLite) and Knex (PostgreSQL); tsc, oxlint and oxfmt are clean.

Long-running jobs could be executed twice from a single enqueue. When a handler runs longer than `stalledThreshold`, the stalled-recovery path re-delivers it to a free slot because nothing ever refreshes `acquiredAt` between claim and completion — so the threshold acts as a hard cap on job runtime rather than a crash-detection window. Add a heartbeat that periodically renews the acquired timestamp of jobs currently in the pool: - `renewJobs(queue, jobIds)` on the Adapter contract, implemented for the Redis (Lua `HSET` over the active hash), Knex (UPDATE ... WHERE status = 'active'), Fake and Sync adapters. Only entries still active are renewed, so a job that was already recovered or finalized is never resurrected by a late heartbeat. - A dedicated worker `setInterval` (~`stalledThreshold / 2`) that renews the in-flight job ids. It must be a separate timer: at full concurrency the process loop blocks on `waitForNextCompletion()` with no idle tick, so the loop is not cycling exactly when long jobs are in flight. - The heartbeat is cleared in `stop()` (after draining, so jobs that are still finishing keep being renewed until they complete) as well as in the process() generator's `finally`, guaranteeing deterministic cleanup whether the worker was driven via start() or processCycle(). `#startHeartbeat` is idempotent so the timer can never leak if the loop is re-entered. "Stalled" now means the worker actually died again, so `stalledThreshold` can stay small without re-delivering healthy long-running jobs. Tests cover renewJobs across all adapters (renew keeps an active job from recovery, never resurrects an already-recovered job, is queue-scoped) and two worker-level tests: a long-running job at full capacity is renewed by the heartbeat and executes exactly once, and the heartbeat stops firing once the worker is stopped.

RomainLanz · 2026-06-28T06:46:02Z

Hi!

Thanks for the PR. The approach looks good to me as a first solution.

One caveat is that CPU-bound jobs can still cause issues, since they may block the event loop and prevent the heartbeat from running. That is probably a separate problem, though.

There is one subtle issue I would like to see addressed before merging. A worker can currently renew a job it no longer owns. Could you please pass the workerId to renewJobs and make each adapter renew only jobs whose active lease is still owned by that worker?

renewJobs previously checked only that a job was still active (HEXISTS), so a slow-but-alive worker whose job had been recovered and re-acquired by another worker would keep renewing the new owner's lease — preventing recovery from re-delivering it if that owner later died. Enforce ownership using the worker id the adapter already holds from setWorkerId (as pop does), without changing the renewJobs signature: - Redis: RENEW_JOBS_SCRIPT skips entries whose workerId doesn't match. - Knex: renew UPDATE gains a WHERE worker_id clause. - Fake/memory adapters: record the worker id on pop and filter on it. Add a cross-worker driver test asserting a worker cannot renew a lease owned by another worker, while the legitimate owner still can.

mattstrayer · 2026-06-28T16:00:33Z

Thanks for the review!! great catch on the ownership issue.

I've fixed it so each adapter only renews leases it still owns. Rather than add a workerId parameter, I'm enforcing ownership via the worker id the adapter already holds from setWorkerId (the same one pop uses to stamp the active entry), so the contract signature stays unchanged and the check lives where ownership is already tracked:

Redis — RENEW_JOBS_SCRIPT now skips any entry whose workerId doesn't match the calling worker.
Knex — the renew UPDATE gained a WHERE worker_id = ? clause.
Fake / in-memory — these previously ignored the worker id entirely; they now record it on pop and filter on it in renewJobs.

Added a cross-worker test (runs against Redis and Postgres/SQLite): worker A acquires a job, worker B's renewJobs returns 0 and leaves the lease untouched, while A's own renewal still succeeds.

On the CPU-bound point — fully agree that it's separate. The heartbeat is a timer on the same event loop as the handlers, so a synchronous CPU-bound job blocks it from firing and there's no way around that within a single thread (BullMQ's lock renewal has the same limitation). The real fix is sandboxed/worker-thread processors so the renewal loop stays responsive while the handler crunches.

One related thing I noticed while in here: the same ownership check is missing on completeJob / failJob / retryJob (they also only check HEXISTS). It's probably a low priority issue.... since that's the ordinary at-least-once trade-off rather than active sabotage of recovery, but I'm happy to tighten those in a follow-up if you'd want it.

RomainLanz · 2026-06-28T18:38:08Z

Thanks, this looks good to me!

For completeJob / failJob / retryJob, if that ever happens, it means two workers are processing the same job at the same time, which should never happen in the happy path :D

RomainLanz merged commit 177ebd5 into boringnode:main Jun 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: renew in-flight job timestamps via a worker heartbeat#17

feat: renew in-flight job timestamps via a worker heartbeat#17
RomainLanz merged 2 commits into
boringnode:mainfrom
mattstrayer:feat/renew-jobs-heartbeat

mattstrayer commented Jun 24, 2026

Uh oh!

RomainLanz commented Jun 28, 2026

Uh oh!

mattstrayer commented Jun 28, 2026

Uh oh!

Uh oh!

RomainLanz commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

mattstrayer commented Jun 24, 2026

What this does

Tests

Uh oh!

RomainLanz commented Jun 28, 2026

Uh oh!

mattstrayer commented Jun 28, 2026

Uh oh!

Uh oh!

RomainLanz commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants