Skip to content

Decouple actor lock TTL from workflow deadline via heartbeat#14

Open
jjamroga wants to merge 1 commit into
kagent-dev:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline-kagent
Open

Decouple actor lock TTL from workflow deadline via heartbeat#14
jjamroga wants to merge 1 commit into
kagent-dev:mainfrom
jjamroga:jjamroga/decouple-workflow-deadline-kagent

Conversation

@jjamroga

Copy link
Copy Markdown
Collaborator

ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the workflow deadline and the lock TTL were a single 28s knob. That meant image pulls / restores that legitimately need more than 28s death-looped forever, while raising the knob also raised how long peers wait to retry an actor after a crashed ateapi replica.

Split the two concerns:

  • Lock TTL stays short (30s constant, internal). Bounds peer failover.
  • Workflow deadline is a separate operator-configurable knob via the new --actor-workflow-deadline pflag (default 5m). Bounds a single Resume/Suspend.
  • A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), the workflow ctx is cancelled with errLostActorLock as the cause so in-flight steps unwind cleanly and the mutual-exclusion invariant is preserved.
  • The release function stops the heartbeat (waits for goroutine exit) before best-effort ReleaseLock.

Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring the existing ReleaseLock script.

Fixes #<issue_number_goes_here>

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@EItanya EItanya force-pushed the main branch 3 times, most recently from f16d3d7 to 70af9af Compare July 1, 2026 01:50
ActorWorkflow.ResumeActor and SuspendActor used to derive their workflow
ctx from the Redis lock TTL via acquireActorLock(ctx, id, 30s, 2s) — the
workflow deadline and the lock TTL were a single 28s knob. That meant
image pulls / restores that legitimately need more than 28s death-looped
forever, while raising the knob also raised how long peers wait to retry
an actor after a crashed ateapi replica.

Split the two concerns:

- Lock TTL stays short (30s constant, internal). Bounds peer failover.
- Workflow deadline is a separate operator-configurable knob via the
  new --actor-workflow-deadline pflag (default 5m). Bounds a single
  Resume/Suspend.
- A heartbeat goroutine refreshes the lock every lockTTL/3 (~10s) for
  the full workflow duration. On RefreshLock=false or any Redis error
  (peer stole the lock, Redis blip), the workflow ctx is cancelled with
  errLostActorLock as the cause so in-flight steps unwind cleanly and
  the mutual-exclusion invariant is preserved.
- The release function stops the heartbeat (waits for goroutine exit)
  before best-effort ReleaseLock.

Adds store.Interface.RefreshLock with a Redis CAS Lua script mirroring
the existing ReleaseLock script.
@jjamroga jjamroga force-pushed the jjamroga/decouple-workflow-deadline-kagent branch from 93b64cc to 29512dd Compare July 1, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant