What to do when a process fails halfway

rollbackprocessautomationfailuresreliability

Recovery plan for a multi-step process that fails halfway

The problem is simpler than the terminology

Imagine a common online store flow:

  1. create an order;
  2. reserve stock;
  3. charge or authorize payment;
  4. send a confirmation to the customer.

Now step three times out. The order already exists. Stock may already be reserved. Payment is unclear: the bank may have accepted the request, or it may not. The customer has not received a clean confirmation.

This article is about that situation: how to describe recovery actions before the failure happens, so the team does not investigate every partial failure from scratch.

Main idea: every step needs a cleanup plan

When a process has several steps, a simple try/catch is often not enough. The code can see that an error happened, but it may not know what already happened in an external system.

For every step, write down three things:

  • what the step changes;
  • how to check whether the change really happened;
  • what to do if a later step fails.

That is the practical rollback plan. You do not need to start with the word saga. The team needs a simpler question first: “how do we move this process back into a known state?”

Example: what to record for an order flow

For the “reserve stock” step, the plan could be:

  • change: warehouse availability was reduced;
  • check: look up the reservation by order_id or reservation_id;
  • recovery action: release the reservation if payment is not confirmed;
  • important rule: do not release the same reservation twice.

For the “payment” step, the plan is different:

  • change: the bank may have created a payment or authorization;
  • check: ask the payment provider for status by external payment_id;
  • recovery action: cancel the authorization or create a manual review task;
  • important rule: if status is unknown, do not cancel blindly.

Now the process is not hidden magic inside code. It becomes a table of understandable decisions.

The dangerous case: unknown state

The hardest case is not a clear failure. The hardest case is a timeout.

For example, the payment service did not respond. You have three possible states:

  • payment definitely was not created;
  • payment definitely was created;
  • unknown.

Automation can handle the first two cases. The third needs a separate state, for example SUSPECT or “needs review”. That means: stop the process, do not run later steps, and show the team a specific review task.

This is not weak automation. It is a guardrail that prevents a bigger mistake.

Why rollback usually runs backwards

If a process completed several steps, recovery often needs to happen in reverse order.

Undo what happened last, then move backward. It is like leaving a workspace: if you opened a drawer, took out a box, and then spread tools across the table, you usually clean the last action first.

For a technical process, that means:

  1. if no confirmation was sent, nothing needs to be undone there;
  2. if payment is unclear, check payment status;
  3. if there is no payment, release reserved stock;
  4. if the order no longer makes sense, mark it as canceled or needs review.

The key is to avoid one large “cleanup everything” script. It becomes risky very quickly.

Where saga and compensation fit

Now the terms become easier.

A saga is an approach where a long process is split into steps, and every step has a recovery action for later failures.

Compensation is not always a perfect undo. Sometimes you cannot restore the exact previous state. Compensation means moving the system into an acceptable business state. For example:

  • do not delete the order; mark it as “needs review”;
  • do not retry payment blindly; check bank status first;
  • do not edit stock directly; release the exact reservation.

So this is not an academic topic. It is a way to reduce manual cleanup after partial failures.

What to do in one sprint

Do not rebuild the whole system immediately. Start with one painful process.

  1. Write the real steps in order.
  2. For each step, record what it changes in external systems.
  3. Add a check that proves whether the action happened.
  4. Add a recovery action: roll back, stop, or send to a human.
  5. On staging, deliberately break a middle step and verify that the system does not create duplicates.

After that, the article is no longer about “saga workflow chaos”. It is about a practical question: if a process fails halfway, does the team already know what to do?

Anti-patterns

  • One global cleanup script for every situation.
  • Retrying payment without checking whether the first request succeeded.
  • Rollback without a log of completed steps.
  • Automatic cancellation when state is unknown.
  • No separate “needs review” status.

Sources

Quick checklist

  • Write all process steps in their real execution order.
  • For every step, record what it changes outside the main system.
  • Add a simple check that proves whether the step really happened.
  • Define the recovery action: rollback automatically, send to manual review, or stop the process.

Create a simple recovery plan for a failed multi-step process

Input: - Name of the process that sometimes fails halfway - Up to 8 ordered steps - What each step changes in external systems - Data needed to verify each step result - What the team currently fixes manually Task: 1) Explain the process in plain language. 2) For each step, define what should happen after a later failure. 3) Mark steps where the state can become unknown. 4) Add a verification check after every rollback action. 5) Create a small staging test plan. Output: - A short team runbook - A step list with: - step - what changed - how to check state - what to do after failure - when a human is needed