Durable execution

Durable execution is the property that makes Catalyst workflows and agents resilient to crashes, restarts, deploys, and infrastructure failures. Every step a workflow takes is persisted, so when a process dies mid-flight, the execution resumes from the last successful step rather than starting from scratch. This page explains the mechanics — what gets persisted, how state is reconstructed, and what rules your orchestrator code has to follow.

This model is inherited from Dapr Workflow, the open-source runtime Catalyst is built on. The semantics described here apply equally to workflows you author directly (see Develop workflows) and to AI agents running on Catalyst (see AI agents) — agents are workflows underneath.

What "durable" actually means

A durable workflow can crash at any point — process kill, pod eviction, region failover, deploy rollout — and on restart it resumes execution from exactly where it stopped. No partial work is repeated, no in-flight state is lost, no external side effects are duplicated. From the application's perspective the crash is invisible: code that called ctx.call_activity(...) before the crash sees the same return value after the restart and continues to the next line.

This isn't checkpointing in the snapshot sense — Catalyst does not pause your code to serialise the call stack. It's a much more powerful property, achieved through event sourcing and deterministic replay.

Event sourcing: the history log

Every workflow instance has a history: an append-only log of every event that has happened to it. Events look like:

WorkflowStarted(input=...)
ActivityScheduled(name="charge_card", input=...)
ActivityCompleted(name="charge_card", output=...)
TimerCreated(fires_at=...)
TimerFired(...)
ExternalEventReceived(name="human_approval", payload=...)
WorkflowCompleted(output=...)

The history is written to a durable store (managed by Catalyst) every time the workflow code reaches an await point — that is, every time it pauses to wait for an activity, a timer, or an external event. The history is the workflow's only source of truth. There is no snapshot, no serialised stack, no pickled state — just the log of what has already happened.

Replay: reconstructing state without snapshots

When a workflow needs to resume — whether after a normal await, a process crash, or a region failover — the runtime does not restore a snapshot of variables. Instead, it re-runs the orchestrator function from the top, replaying the history event by event:

The orchestrator starts executing from line 1.
When it reaches a call like result = yield ctx.call_activity(charge_card, order), the runtime checks the history.
If the history shows ActivityCompleted(name="charge_card", output={...}), the runtime returns that recorded output immediately — the activity is not re-executed.
The orchestrator continues to the next await, and the same replay-check happens.
Eventually the orchestrator catches up to the end of history and either pauses again (to await the next event) or returns (completing the workflow).

The implication is profound: a workflow that has been running for 30 days, has called 200 activities, and has waited on 50 timers can be resumed on a fresh process in milliseconds — by replaying the history through your orchestrator code until execution catches up to the present.

The local variables in your orchestrator function (order_id, customer, accumulated results, etc.) are reconstructed naturally as a side effect of replay. You did not have to ask Catalyst to persist them — they're just whatever your code computes from the inputs and the activity results, both of which are recorded.

The determinism requirement

For replay to produce the same state as the original execution, the orchestrator function must be deterministic: given the same history, it must always take the same path. This is the single most important rule of durable execution, and it has direct consequences for the code you write inside an orchestrator.

What you CANNOT do directly in orchestrator code

Random numbers or UUIDs. random.random(), uuid.uuid4(), etc., produce a different value every call.
Current time. datetime.now(), time.time(), etc., move forward.
I/O. Network calls, file reads, environment variables, HTTP clients — anything that depends on the outside world.
Blocking primitives. time.sleep, Thread.sleep, await asyncio.sleep outside the workflow context.

Each of these would produce a different result on replay than on first execution, which would silently corrupt the workflow's state.

What you CAN do

Plain in-memory computation. Arithmetic, string manipulation, looping over recorded activity results, building structured outputs.
Activities. Anything non-deterministic — every line in the "cannot" list above — goes inside an activity. Activities run outside the replay loop; their result is recorded in history and replayed back to the orchestrator deterministically.
Durable timers. Use ctx.create_timer(timedelta(...)) instead of time.sleep. The timer is persisted; it survives crashes and replays as a recorded event.
External events. Use ctx.wait_for_external_event(...) to pause the workflow until something outside signals it (a webhook, a CLI command, a UI action).
The workflow's recorded clock. ctx.current_utc_datetime returns the time the orchestrator was first executed — same value on replay.

The deterministic-replay rule is what makes long-running, fault-tolerant workflows possible at all. Once you put every non-deterministic operation inside an activity, the orchestrator becomes a pure function from history to outcome — and pure functions can be replayed safely forever.

Activities: the escape hatch for the real world

Activities are where the workflow touches reality. They run in your application process — same code, same runtime as your orchestrator — but each activity invocation is treated as an atomic, recorded unit. The runtime:

Schedules the activity (records ActivityScheduled in history).
Invokes your activity function (which can call any API, do any I/O, take any time).
Records the result in history (ActivityCompleted or ActivityFailed).
Resumes the orchestrator, returning the recorded result.

If the process crashes between steps 2 and 3, the runtime will retry the activity when the workflow resumes. This means activities must be idempotent for any operation with external side effects — if an activity sends an email or charges a card, a second invocation must not duplicate the action. Common approaches: idempotency keys derived from the workflow instance ID, conditional writes, or "check-then-act" with an idempotent receiver.

Durable timers and external events

ctx.create_timer(timedelta(days=7)) does not block a thread or hold a connection. It writes a TimerCreated event to history with a wake-up time, then the workflow pauses entirely — the process can scale down, redeploy, or crash. When the wake-up time arrives, the runtime schedules the workflow to resume; replay walks the history (including the TimerFired event) and execution continues. The same model applies to wait_for_external_event — the workflow sleeps until something raises the named event.

This is how a Catalyst workflow can durably wait for a human approval over the course of a week, polling no resources and consuming no compute while waiting, then resume the moment the approval arrives. See the human-in-the-loop example in Workflow patterns.

Continue-as-new: bounding history

For workflows that run forever — monitor loops, long-running agents, scheduled poll-and-act patterns — the history would grow unboundedly. The fix is continue_as_new: the workflow ends itself and starts a fresh instance with a new (small) input, truncating the history. From the caller's perspective the workflow keeps running; under the hood the runtime is rotating instances to bound storage. Use it for any workflow that loops forever.

What this means in practice

The durable-execution model has three practical implications you should internalise before writing your first workflow:

Anything that touches the outside world goes in an activity. When in doubt, wrap it.
The orchestrator function will run many times. Print statements, log lines, and side effects in the orchestrator will fire on every replay — use activities for logging too if you care about not flooding logs.
Code changes need to be replay-safe. A workflow that's been running for a week against version N of your code might replay against version N+1 after a deploy. See Workflow versioning for the safe-change rules and migration strategies.

What "durable" actually means​

Event sourcing: the history log​

Replay: reconstructing state without snapshots​

The determinism requirement​

What you CANNOT do directly in orchestrator code​

What you CAN do​

Activities: the escape hatch for the real world​

Durable timers and external events​

Continue-as-new: bounding history​

What this means in practice​

See also​