What are executable specifications?

Runnable specs for critical stateful software.

The short version

An executable spec is a small program that captures the high-level behavior your system must preserve.

It models state, actions, failures, and invariants.
Then it can:

generate scenarios your team would not write by hand
check whether the rules still hold
drive adversarial tests against your system

UML diagrams and Markdown design docs help with discussion. For system verification and software verification, a runnable spec gives you executable evidence: it runs, produces tests, and can go into CI.

Why this matters

Executable specs help when the risk lives in behavior of the system:
ordering, retries, stale reads, crashes, partial failure, and recovery paths.

01

Rare paths cause real incidents

Production failures often happen on paths nobody wrote tests for:
message interleavings, retries, timeouts, stale reads, crashes, and partial failures.
02

AI-generated changes need a reference

AI can produce stateful code faster than reviewers can prove the behavior is intact.
A runnable spec gives humans and agents expected behavior to test against.
03

Reviews need runnable evidence

Important “must never happen” rules become checks you can rerun before a launch, migration, audit, or external review.

Example: a DNS balancer control plane

This example models a DNS load balancer control plane with garbage collection of stale DNS plans.
The full model has several actions; this snippet shows just one of them.

In this kind of system, bugs often come from a sequence of valid actions that creates an invalid state: an old plan is deleted too early, a retry observes stale data, or a recovery path leaves the root DNS record pointing somewhere it should not.

The spec models those actions at the level where the bug can appear, then checks the rule the system must always preserve.

@action(inline=False)
def deployer_gc(c: Context[DnsBalancerState], deployer: Annotated[Expr, str]):
    """Deployer cleans up old plans."""
    s = c.state
    next_plan = s.deployer_next_plan[deployer]
    old_indices = s.planner_updates.keys.filter(
        lambda i: next_plan > i + s.MAX_PLAN_AGE
    )
    old_plan_entries = Set(
        s.planner_updates[i].creates for i in old_indices
    ).flattened

    # avoid unrealistic no-op cleanups
    c.assume(~(old_plan_entries & s.zone_records).is_empty)

    # implementation may do this in a loop;
    # the spec models the abstract effect
    s.zone_records -= old_plan_entries


@invariant
def no_inconsistent_root(s: DnsBalancerState):
    """Root must always point to an existing plan."""
    cnames = SetIf(r.kind == CNAME for r in s.zone_records)
    plans = Set(r.value[0] for r in SetIf(r2.name == ZONE for r2 in cnames))
    return ~((plans & Set(r.name for r in cnames)).is_empty) & (plans.size == 1)

From design intent to executable behavior

The action deployer_gc models the effect of garbage collection: old plan records disappear. The invariant no_inconsistent_root checks the rule the system depends on: the root DNS record must always point to an existing plan. Every generated scenario checks that rule.

What is in an executable spec?

An executable spec is a code-level reference for the behavior you care about. It captures the parts that decide whether the system behaved as expected.

01

State

The information that matters for the behavior under test: plans, versions, balances, leaders, locks, queues, records, or pending work.
02

Actions

The operations that can change that state: requests, retries, failovers, reconciliations, commits, rollbacks, timeouts, or background jobs.
03

Failures and observations

The messy parts that ordinary examples tend to skip: stale reads, delayed messages, duplicate requests, crashes, partial failure, and recovery.
04

System invariants

The rules that must keep holding after every generated scenario. These become executable checks, not comments in a design doc.

How Much Do You Model?

An executable spec is not a second implementation of the whole system. It leaves out details that do not affect the behavior under review.

It is also not a static design document. The spec runs: it can generate scenarios, check invariants, produce counterexamples, and provide expected behavior for tests or review.

One executable model, four uses

Exploration

Generate schedules that people would never write by hand, before the system reaches production.

Test generation

Generate scenarios and expected outcomes, then replay them against APIs, CLIs, simulators, or test harnesses.

Review artifact

Make atomicity, freshness, retries, rollback, and garbage-collection assumptions visible to the team.

Living documentation

Keep a runnable reference for behavior the team depends on.

Who This Is For

1
You own a critical stateful system. A control plane, scheduler, reconciliation loop, database, payment or fiscal workflow, protocol, smart contract, or critical infrastructure.→ Make the critical behavior explicit.
2
A reliability failure would be expensive. A bug that causes financial loss, security exposure, data loss or inconsistency, downtime, missed SLAs, or reputation loss.→ Find risky paths while they are cheap to fix.
3
Existing tests cover known cases. Known examples pass. Rare schedules, stale reads, retries, crashes, partial failure, and recovery paths are still blind spots.→ Generate them instead.
4
Critical invariants are implicit. The "must never happen" properties are scattered across tribal knowledge, design docs, postmortems, review checklists, or AI prompts.→ Turn them into runnable tests.

Common starting points: during design or before launch, when state, ordering, or failure behavior is risky; after an incident; or before a migration, audit, or external review.

Common targets: control planes, schedulers, reconciliation loops, databases, payment flows, protocols, smart contracts, and stateful cloud infrastructure.

Book a 30-minute fit call

Email me with 2-3 sentences about the subsystem, the risk, and the timeline. No preparation is needed.

Based in Vienna, Austria.
Available across Europe and internationally.

I typically respond within one business day.

Email me to book a fit call

Good fit: real testing pain in a critical stateful system, before launch, after an incident, before a migration, or before an external review.
Not a fit: more unit tests, broad code coverage, generic security scanning, or a document nobody will run.
First call: we check fit and a small next step. For a pilot or review, we later agree on scope, deliverables, timeline, and required access.