Decouple actor lock TTL from workflow deadline via heartbeat by jjamroga · Pull Request #14 · kagent-dev/substrate

jjamroga · 2026-06-29T19:14:10Z

ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica.

Split the two concerns:

Lock TTL stays short (30s constant, internal). Bounds peer failover.
Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend.
A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved.
The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock.

Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.

Fixes #<issue_number_goes_here>

It's a good idea to open an issue first for discussion.

Tests pass
Appropriate changes to documentation are included in the PR

ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica. Split the two concerns: - Lock TTL stays short (30s constant, internal). Bounds peer failover. - Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend. - A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved. - The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock. Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.

EItanya force-pushed the main branch 3 times, most recently from f16d3d7 to 70af9af Compare July 1, 2026 01:50

jjamroga force-pushed the jjamroga/decouple-workflow-deadline-kagent branch from 93b64cc to 29512dd Compare July 1, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple actor lock TTL from workflow deadline via heartbeat#14

Decouple actor lock TTL from workflow deadline via heartbeat#14
jjamroga wants to merge 1 commit into
kagent-dev:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline-kagent

jjamroga commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jjamroga commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant