How to investigate a test that does not fail every time

testingcidebuggingai

How to diagnose flaky CI tests: timing, shared state, randomness, retries, logs, and the smallest reliable repro

What a flaky test is

A flaky test is a test that sometimes passes and sometimes fails without a clear code change explaining the difference.

The cause is usually not just one thing. It often sits at the boundary between:

  • timing or race conditions;
  • shared state between tests;
  • random data or seeds;
  • network, clock, filesystem, or locale dependence;
  • an assertion that checks the wrong thing too rigidly.

If it fails every time, it is probably a regular bug, not a flaky test.

A tiny incident scene

An “order creation” test fails in the nightly pipeline. A rerun passes. The team is tempted to add retry and move on.

But the log shows the response sometimes arrives 400 ms later than the assertion expects. That is your clue. Do not start by rewriting the test. First figure out whether the problem is timing, state leakage, or randomness.

How to think about it

When a rerun turns green, ask three questions:

  1. Does the same test fail repeatedly if you run it in a loop?
  2. Does it fail only in CI, or also on a local machine?
  3. Does the failure point toward time, order, parallelism, or random input?

Quick signal-to-cause map:

SymptomWhat to checkLikely cause
Fails only sometimesLoop runs, reruns, seedtiming, randomness
Fails only in CIRunner version, timezone, parallelismenvironment mismatch
Depends on test orderCleanup, leaked stateshared state
Times out waiting for UI or APIAsync wait, debounce, delayed eventstiming / race
Breaks with random dataSeed, factories, generatorsrandomness

Smallest reliable repro

The goal is not to reproduce the entire pipeline. The goal is to reproduce the failure under the smallest number of conditions.

Checklist:

  • capture the test name, branch, commit, and environment;
  • run only that test 20 to 50 times;
  • disable parallelism if shared state is suspicious;
  • pin the random seed;
  • keep the full log, not just the last line;
  • compare local output with CI output;
  • check whether a retry is hiding the symptom instead of fixing the cause.

When it is not flaky

Sometimes a test only looks flaky because it is exposing a real bug.

It is probably not flaky if:

  • it fails at the same step every time;
  • the rerun behaves exactly the same way;
  • the log points to one consistent root cause;
  • changing the test would only hide a broken product behavior.

In that case, the test is not the problem. It is the messenger.

In one line

A flaky test is a sign that the system depends on time, state, or randomness in a way the test does not control.

The right order is: collect symptoms, narrow the cause, build the smallest repro, and only then decide whether to change the test, the code, or both.

Quick checklist

  • Capture the full log from the first failure and the rerun.
  • Check whether the result changes in a loop.
  • Disable parallelism if shared state is suspicious.
  • Pin the seed and compare CI with a local run.
  • Do not confuse retry with a real fix.

Diagnose a flaky CI test

You are helping diagnose a flaky CI test. Context: [framework], [test name], [exact failure], [full log], [whether rerun passes], [parallelism], [random data], [seed], [timeout], [differences between local and CI]. Do three things: 1. List the 3 most likely causes in order. 2. Ask for the 3 most useful missing data points. 3. Propose the smallest reliable repro with minimal code changes.