Error sagas in multi-step processes: how to recover from partial failure

Start with an incident you can recognize

Imagine an order checkout process. The system creates a database record, reserves inventory, charges the customer, creates a delivery task, and sends a confirmation email. On a diagram it looks like a straight line: step 1, step 2, step 3, step 4.

Now picture a real day. The database record is created. The inventory reservation succeeds. The payment service takes too long to respond, and our process receives a timeout. The automatic retry starts the problematic part again, but the external service may already have accepted the payment. The result can be one local order, two payment attempts, one delivery queue entry, and an unclear status for support.

That is a partial failure: some work has already happened, but the process did not reach a clean finish. The main danger is not the failure itself. The danger is that a retry without rules can make the state worse.

Why a simple retry is not enough

Retry is useful when a step changed nothing or can safely recognize a duplicate. For example, if we read data from an API and hit a temporary network error, trying again is usually fine.

Side-effecting steps are different. A side effect means something changed in the world: money was charged, a shipment was created, an email was sent, inventory was reserved, or an event was written to a queue. These actions cannot be repeated blindly, because every repeat may create another real effect.

So each step needs three answers before production:

can this step be repeated without creating a duplicate;
can it be compensated if the next step fails;
does it need manual review because automated recovery would be unsafe.

What a saga means in plain language

A saga is a way to think about a long process as a chain of small actions rather than one giant transaction. For each important action, we describe a reverse action.

Examples:

inventory was reserved — release the reservation;
delivery draft was created — cancel the draft;
payment was charged — trigger a refund or send the case to finance;
email was sent — you cannot “unsend” it, but you can avoid sending a duplicate and explain the state in the next message.

The important detail is order. Recovery runs backward. If the process completed steps 1, 2, and 3, then failed on step 4, compensate step 3 first, then step 2, then step 1. Otherwise you can remove the foundation that a later step still depends on.

Data that keeps rollback from becoming guesswork

A compensating step must know exactly what it is compensating. It is not enough to store “payment created” or “delivery started”. You need concrete markers:

operation_id — shared identifier for the whole run;
step_id — the step that was executing;
external id: payment, delivery, reservation, or job id;
step status: started, completed, unknown, compensated, needs manual review;
last attempt time and retry count;
short failure reason.

This data is not for pretty logs. It answers the practical question: “Are we allowed to compensate this step now, and if yes, which external object should we touch?”

Design the step before writing handlers

Before implementation, create a simple table. For every step, write down:

what the step does in the forward direction;
which state or external service it changes;
which data must be saved after success;
what the compensating action is;
when compensation must not run;
whether the step supports an idempotency key.

An idempotency key is a stable identifier for retries. If a payment service receives two identical requests with the same key, it should understand that this is one operation repeated, not two separate payments. Not every service supports this, so it must be checked explicitly.

Rules that reduce chaos

First rule: one process has one rollback owner. If two workers decide to “fix” the same run at the same time, they can create a new race.

Second rule: missing step output does not mean the step did not happen. A timeout after an external call often means “we do not know”. Before compensating, check the real state in the external service or move the run into manual review.

Third rule: retries need a limit. Infinite retries can fill queues, create alert noise, and hide the real cause.

Fourth rule: rollback failure is a separate event, not the same failure again. It should be logged and surfaced separately, because after failed compensation the system is often in its riskiest state.

When to retry, compensate, or stop

The rule of thumb is simple:

Retry the step if it has no side effect or is guaranteed to be idempotent.
Run compensation if the step already changed state and has a tested reverse path.
Stop and hand it to a person if the effect is irreversible, financially or legally sensitive, or the external service state is unknown.

For example, retrying a user profile read is safe. Retrying a payment charge without an idempotency key is not. Canceling a delivery draft can be automatic if you have its id. An already sent email needs a different approach: you cannot roll it back, but you can prevent duplicates and explain the state correctly to the customer.

What to test before production

A successful happy path proves very little. These processes need partial-failure tests:

the process fails after each individual step;
the process fails after an external call but before writing the result to the database;
the external service times out even though it may have completed the action;
a retry arrives with the same operation_id;
two workers pick up the same run at the same time;
the compensating step itself fails.

For every test, define the expected final state: what remains active, what was compensated, what waits for manual review, and which alert should fire.

Where Cloudflare Workflows fits

Cloudflare described rollback for Workflows as a platform feature, but the idea is broader than one provider. It is useful anywhere long-running processes have multiple states: queues, background workers, payment integrations, deployment pipelines, and order processing.

The point is not the tool name. The point is the discipline: every step has explicit state, a saved result, a retry rule, and a compensation rule. Then a failure stops being a late-night manual investigation and becomes a controlled recovery scenario.

Sources

https://blog.cloudflare.com/rollbacks-for-workflows/

Quick checklist

Split the process into steps and record which state each step changes.
For every external action, add a compensating step or clearly mark it as manual recovery.
Persist the result of every critical step before moving forward.
Test failure after a step, retry after partial success, timeout, and rollback failure.

Plan Recovery for a Multi-Step Process After Partial Failure

Your task is to help a team prepare a safe recovery plan for a multi-step process. Inputs: 1) List of process steps (up to 20), each with a stable identifier. 2) For each step: what it changes (database, external API, queue, email, payment, reservation). 3) Whether the external call can be safely repeated with the same idempotency key (yes/no/unknown). 4) Where process state is stored today: database, key-value store, jobs table, event log, or state machine. 5) What observability exists: logs, metrics, traces, dead-letter queue, alerts. Produce: - A step table: identifier, forward action, compensating action, rollback condition, data to persist, success and failure signals. - Reverse-order rollback sequence: from the latest successful step back to the first. - Policy for rollback failure: what to retry, what to move into manual review, who to alert. - Decision matrix: when to retry, when to run compensation, when to stop for manual review. - Partial-failure tests: - failure after each step; - crash before state persistence; - timeout during an external call; - retry after partial success; - two parallel runs with the same operation_id. - Production readiness checklist: at least 8 items, split into engineering and operations. Be concrete: rules, tables, example states. Do not stop at generic advice like "check the docs".