Skip to content

Ben Abbitt

11 posts by Ben Abbitt

direnv: Scoping Secrets for Polyglot, Poly-Project Development

Stop putting secrets in your shell profile. If you work across multiple cloud projects, languages, and infrastructure stacks concurrently, direnv scopes your environment per directory—so the right credentials are active for the right project, and nothing leaks sideways.

This is a short post about a small tool that eliminates an entire class of “wrong account” mistakes.

If you work on one project with one cloud account, environment variables in .zshrc are fine. Export your tokens, source your profile, move on.

That stops working when you’re a polyglot developer juggling multiple projects with different cloud providers, different accounts, different API keys, and different infrastructure stacks. The .zshrc approach gives you a flat namespace:

Terminal window
# .zshrc — which project are these for?
export GITHUB_TOKEN="ghp_..."
export DATABASE_URL="postgres://..."
export SUPABASE_ANON_KEY="eyJ..."
export TF_VAR_project="my-gcp-project"
export NEON_API_KEY="..."
export PORKBUN_API_KEY="..."

Every shell session sees every secret. Switch from Project A to Project B? Same DATABASE_URL. Same TF_VAR_project. Same everything. You’re one terraform apply away from modifying the wrong infrastructure because you forgot which project’s credentials are loaded.

This isn’t hypothetical. If you’ve ever run a command against the wrong cloud project, you know the feeling.

direnv loads and unloads environment variables based on your current directory. You put an .envrc file in a project directory, and when you cd into it, those variables are set. When you cd out, they’re unset.

~/workspace/travel/infra/.envrc
export TF_VAR_project="travel-prod"
export TF_VAR_db_password="$(bw get password 'travel-db')"
export SUPABASE_ANON_KEY="eyJ..."
# ~/workspace/angzarr/core/.envrc
export DATABASE_URL="postgres://localhost:5432/angzarr_dev"
export RUST_LOG="info"

cd ~/workspace/travel/infra — travel credentials loaded. cd ~/workspace/angzarr/core — travel credentials gone, Angzarr credentials loaded. No manual sourcing, no remembering which project you’re in, no stale variables from the last directory.

Install direnv, hook it into your shell, and trust your .envrc files:

Terminal window
# Install (Debian/Ubuntu)
sudo apt install direnv
# Hook into zsh (~/.zshrc)
eval "$(direnv hook zsh)"
# Create a project .envrc
echo 'export MY_VAR="value"' > ~/workspace/my-project/.envrc
# Trust it (required on first use or after changes)
direnv allow ~/workspace/my-project

That’s it. There’s no daemon, no config file, no service to manage.

The value scales with the number of projects you maintain. With two projects, you might remember which credentials are active. With five—each with its own cloud provider, database, API keys, and Terraform state—you won’t.

direnv gives you:

Isolation by default. Project A’s credentials don’t exist in Project B’s shell. You can’t accidentally terraform apply against the wrong account because the wrong account’s variables aren’t set.

Secret scoping. API keys, database URLs, and cloud credentials live next to the project that uses them, not in a global profile that every project inherits. Add .envrc to .gitignore and secrets stay local.

Composability. .envrc files can source other files, call CLI tools (like bw for Bitwarden or gcloud for GCP tokens), and inherit from parent directories. A team-level .envrc can set shared defaults while a project-level one overrides specifics.

Onboarding simplification. “Clone the repo, create an .envrc with these variables, run direnv allow” is easier to communicate and harder to get wrong than “add these twelve exports to your shell profile and make sure they don’t conflict with your other projects.”

Audit your shell profile. Anything project-specific should move to an .envrc:

Keep in .zshrcMove to .envrc
PATH modificationsDATABASE_URL
Shell aliasesTF_VAR_*
Tool initialization (nvm, pyenv)Cloud API keys
Editor configProject-specific tokens
General preferencesSUPABASE_*, NEON_*, etc.

The rule: if it’s specific to a project or cloud account, it belongs in that project’s .envrc. If it’s about your development environment in general, it stays in .zshrc.

Your .envrc files contain secrets. They should not be committed:

.gitignore
.envrc

If you want to share the structure without the values, commit an .envrc.example:

Terminal window
# .envrc.example — copy to .envrc and fill in values
export DATABASE_URL=""
export TF_VAR_project=""
export API_KEY=""

The Small Tool That Prevents the Big Mistake

Section titled “The Small Tool That Prevents the Big Mistake”

direnv is a small tool. It does one thing. But for developers working across multiple cloud projects, languages, and infrastructure stacks, that one thing eliminates an entire category of mistakes that range from embarrassing to catastrophic.

The cost of adopting it is one line in your .zshrc and a few minutes moving exports into .envrc files. The cost of not adopting it is eventually running terraform destroy against production because you forgot you were still pointing at the wrong project.


This is one of those tools where the setup time is measured in minutes and the first “oh, that would have been bad” moment comes within the week.

K8s Pods as AI Architecture Guardrails

Deployment boundaries enforce architectural boundaries—especially when your coworker is an AI that will take any shortcut it can see.

I caught Claude reading directly from another aggregate’s database because the file was visible in the project. Deploying to K8s (via Kind) broke that access and forced it to implement projector services properly. The infrastructure did what code review should have caught.

I started building a board game in angzarr-standalone, the now-deprecated in-process variant of the Angzarr system. The backing stores are SQLite—perfectly fine for prototyping a physical board game on a CQRS framework, though not suitable for real production use of Angzarr. Everything runs in one process, which means every aggregate’s storage is visible to every other component.

Claude, tasked with getting a feature working, found the shortest path: query another aggregate’s database directly. No projector, no gRPC service, no event subscription. Just reach across the aisle and read the data. It works. It’s fast. It violates every principle of aggregate isolation that event sourcing depends on.

I didn’t catch it immediately. We’re in prototype mode—moving fast, revisions planned for later. But the shortcut was building in coupling that would be painful to unwind.

Moving to Kind—a local Kubernetes cluster—broke the access by default. Each aggregate runs in its own pod. There is no shared filesystem. There is no “just read the other database.” If you need data from another aggregate, you go through a projector service.

Claude had no choice but to implement the projector. The architecture became self-enforcing.

This wasn’t the reason I moved to Kind. But it was an immediate, tangible benefit. The deployment topology eliminated an entire class of architectural violations—not through code review, not through linting, not through discipline, but through access control.

LLMs optimize for task completion. Given a goal and visible resources, they will use whatever path gets them there. This is useful—it’s why they’re productive. But it means they will cheerfully violate architectural boundaries that exist only as conventions.

In-process deployment makes everything a convention. Aggregate isolation? Convention. Service boundaries? Convention. Data ownership? Convention. An AI (or a junior developer, or a senior developer under deadline pressure) can bypass any of them because the runtime doesn’t enforce them.

Pod-level isolation converts conventions into constraints:

BoundaryIn-ProcessK8s Pod
Aggregate data accessConvention (filesystem visible)Enforced (separate storage)
Service interfacesConvention (can call anything)Enforced (network only)
Domain isolationConvention (shared memory)Enforced (process boundary)

This doesn’t mean every development environment needs Kubernetes. But it does mean that the gap between your development topology and your production topology is a source of architectural drift—and AI-assisted development widens that gap faster than human-only development does.

Watch the AIs closer than you think you need to. We’re in prototype and proof-of-concept mode—I’m the only user, the risk is low, and revisions and cleanup will come later. That’s fine. But the shortcut Claude took was building in coupling that would make those revisions harder. Prototype mindset means accepting rough edges in implementation, not in architecture. The architecture is the thing that makes cleanup possible later.

Deploy like production earlier than you think. Kind is cheap. Running a local K8s cluster adds minutes to your feedback loop, not hours. The architectural enforcement you get back is worth it—especially when AI is writing code.

Architecture-as-infrastructure beats architecture-as-policy. You can tell an AI “don’t read from other aggregates’ storage.” You can put it in a CLAUDE.md file. You can review every diff. Or you can make it impossible. One of these scales.

Standalone mode was the right call to deprecate. This experience reinforced why Angzarr dropped angzarr-standalone in 0.3.0. In-process variants are convenient for getting started, but they teach habits—and now train AIs—that break in distributed deployment. Better to start with the real topology.

This is the broader lesson, and it extends well beyond this one incident: the key to making AI developers productive is constraining them.

Without constraints, an AI coding assistant is a useful idiot. It can output code that vaguely satisfies requirements. It has no concept of future maintainability. It has no lessons learned beyond what it copies and mimics from its training corpus. It will cheerfully build a system that works today and is unmaintainable tomorrow, because “tomorrow” isn’t in the prompt.

Constraints change the equation. Container isolation, defined interfaces, enforced service boundaries, typed contracts—these don’t just prevent bad architecture. They channel the AI’s output into shapes that are maintainable by default. The AI doesn’t need to understand why aggregate isolation matters. It just needs to be unable to violate it.

This is the same principle that makes strongly-typed languages productive: the compiler rejects invalid programs before they run. K8s pod boundaries, gRPC interface definitions, and aggregate isolation do the same thing at the architecture level. They reject invalid architectures before they deploy.

The difference between an AI that produces throwaway prototypes and an AI that produces maintainable systems isn’t the AI—it’s the constraints it operates within:

Without ConstraintsWith Constraints
Reads any data it can seeMust use defined query interfaces
Couples components by convenienceCouples components by contract
Satisfies the immediate requirementSatisfies the requirement within architectural boundaries
Produces code that worksProduces code that works and can be changed later

None of this requires the AI to be smarter. It requires the environment to be constrained—not just opinionated, but enforced. We must make the right choice the easy choice—this works for human developers too, but it’s non-negotiable for AI ones. The AI fills whatever shape you give it. Give it a flat, open codebase and it will sprawl. Give it containers, interfaces, and enforced boundaries and it will build systems that happen to be well-structured—not because it intended to, but because it had no other option.

This has implications for framework design. Angzarr’s pod-per-aggregate model, its gRPC service interfaces, its saga protocols—these were designed for distributed systems correctness. It turns out they’re also exactly what you want when AI is writing the implementation. The framework’s opinions become the AI’s guardrails.


The irony: the AI that took a shortcut around the architecture is the same one helping me write about why shortcuts around the architecture are dangerous. Supervision remains non-optional.

Should Projectors Serve Data?

Yes. For most systems, projectors should serve their own data. Run the projector and its gRPC query endpoint in the same pod. Split them when scaling demands it—not before.

This applies more broadly: collocate components that don’t yet need separation, define interfaces as if they were separate, and split when reality demands it. Angzarr will soon apply this same principle to sagas, allowing them to run directly inside aggregate command handlers—with a clean extraction path when they outgrow it.

Put the projector, read store, and gRPC query service in one pod:

graph TD
    subgraph Pod
        P[Projector<br/>event consumer] -->|writes| RS[(Read Store)]
        RS -->|reads| G[gRPC Service<br/>query endpoint]
    end

When query load overwhelms the pod, or projection lag degrades query latency, or you need to scale reads independently—pull the gRPC service into its own pod:

graph TD
    subgraph Projector Pod
        P[Projector<br/>event consumer] -->|writes| RS[(Read Store)]
    end
    subgraph Query Service Pod
        G[gRPC Service<br/>query endpoint] -->|reads| RS
    end

The interface doesn’t change. Clients don’t know the difference. You’ve scaled without redesigning.

The gRPC service has its own interface definition from day one. The read store already sits between the projector and the query logic. Splitting them apart is a deployment change, not an architecture change—you’re moving a process boundary, not redesigning a system.

Separating the projector from the query service in a low-traffic system buys you an extra pod to deploy, monitor, and debug; network hops you didn’t need; a coordination problem when the read schema changes; and complexity that exists to solve a scaling problem you don’t have.

This is a tradeoff of correctness versus complexity. Complexity reduction should generally win, as long as it can be corrected when it becomes important. The “correct” architecture solves real problems—independent scaling, isolation of projection lag, read model rebuilds without serving impact—but most systems don’t have those problems yet.

Angzarr will likely soon support incorporating sagas directly into aggregate roots and command handlers. Same motivation: for simple sagas tightly coupled to aggregate logic and not under independent load pressure, a separate saga pod is overhead without benefit. The aggregate handles the command, emits events, and performs the coordination—all in one place.

The constraint is identical: it must be easy to peel back out. When the saga becomes complex, when its scaling needs diverge, when a different team needs to own it—extraction should be straightforward. The saga’s interface is already defined. Its coordination logic is already encapsulated. Moving it to its own process is a deployment decision, not a rewrite.

Not everything should start collocated. Split immediately when:

  • Load profiles are already divergent. Hundreds of events per second into the projector, millions of queries per second out—these need independent scaling from the start.
  • Different teams own the read and write paths. Conway’s Law applies. Shared pods across team boundaries create deployment coupling.
  • The read model serves latency-critical paths. If projection rebuilds can’t impact query latency, process isolation is a correctness requirement, not an optimization.
  • Compliance or security boundaries require it. Some read models serve sensitive data through restricted endpoints where process isolation is policy.

These are conditions you can evaluate at design time. If none apply, start simple.

  1. Start simple. Collocate components that don’t yet need separation.
  2. Define interfaces as if they were separate. gRPC services, saga protocols, clear boundaries in code.
  3. Split when the pressure appears. Scaling bottlenecks, team ownership changes, reliability requirements.
  4. The split is mechanical, not architectural. Because the interfaces already exist.

Build for the system you have. Design interfaces for the system you might need. Deploy the simplest topology that works.


This post is part of an ongoing series on pragmatic architecture decisions in event-sourced systems. The opinions are informed by building Angzarr and deploying it in production—where elegance matters less than operability.

Angzarr Core 0.3.0: Facts Over State, Edition Branching, and 18K Lines Deleted

Angzarr Core 0.3.0 ships today with a fundamental shift in how sagas and process managers handle cross-aggregate coordination: they now receive sequence numbers instead of full event books. This “facts over state rebuilding” change aligns coordinators with the framework’s core philosophy. The release also adds explicit divergence support for edition branching, enabling counterfactual “what-if” scenarios at any point in an aggregate’s timeline.

Saga and Process Manager Protocol Update. Handlers now receive destination_sequences—a map of domain to next sequence number—instead of full EventBook state. This is a breaking proto change. Coordinators stamp commands with sequences and let aggregates decide; they no longer rebuild destination state themselves.

Edition Branching with Explicit Divergence. New branches can now specify an exact divergence point from the main timeline. The storage layer reads events from main up to sequence N, returning them as base state for the new branch. Use case: “What if I had folded at sequence 3 instead of calling?”

Cascade Two-Phase Commit. Merged from the core-cascade-improvements branch, adding cascade_id and committed fields to EventPage, stale cascade cleanup via CascadeReaper, and conflict detection for distributed transactions.

OpenTelemetry 0.31 and Tonic 0.14. Updated to latest observability and gRPC stacks with breaking API changes handled throughout.

Security Fixes. Critical gRPC-Go vulnerabilities patched in the gateway (v1.70.0 → v1.79.3), plus high-severity fixes in AWS-LC and quinn-proto.

18,000 Lines Removed. Standalone mode deleted entirely. The framework now exclusively uses the distributed coordinator architecture.

The proto changes require regenerating client code:

// Before (0.2.x)
message SagaHandleRequest {
repeated EventBook destinations = 4;
}
// After (0.3.0)
message SagaHandleRequest {
map<string, uint64> destination_sequences = 4;
}

Same pattern for ProcessManagerHandleRequest.destination_sequences.

Handlers that previously iterated over destination event books to determine state must now use sequences directly. The philosophy: coordinators deal in facts (sequences), not state reconstruction.

The previous design had sagas receiving full event books for destination aggregates. This created several problems:

  1. Unnecessary coupling. Sagas knew how to interpret destination domain events.
  2. Performance overhead. Loading event history for every coordination step.
  3. Philosophy violation. Sagas are coordinators, not domain experts.

The new design treats sequences as facts. A saga knows “Player aggregate is at sequence 7” without knowing what happened in sequences 1-6. It stamps the outbound command with sequence 7, and the aggregate validates whether that sequence is still current.

If the sequence has advanced (concurrent modification), the aggregate rejects the command. The saga retries with fresh sequences. No domain logic leaked into the coordinator.

Editions enable counterfactual reasoning within event-sourced systems. The new explicit divergence support makes this practical:

// Create branch diverging at sequence 3
let edition = Edition {
domain_divergence: Some(DomainDivergence {
sequence: 3,
// Branch sees events 1-3 from main, then diverges
}),
..
};

Key implementation details:

  • EventStore::get_with_divergence() reads main timeline up to the divergence point
  • Snapshots accelerate loading state up to the divergence point
  • The aggregate applies events 1-3, then processes new commands on the branch

Use cases include game replay analysis, regulatory “what-if” scenarios, and training data generation from production event streams.

Two-phase commit for cross-aggregate operations is now first-class:

  • CascadeReaper cleans up stale cascades that failed mid-transaction
  • Conflict detection identifies when concurrent cascades touch the same aggregates
  • Query methods (query_stale_cascades, query_cascade_participants) added to all storage backends

This work also drove mutation testing improvements, pushing mock EventStore coverage from 62.9% to 100% kill rate.

  1. Regenerate protos using Buf or your preferred tooling
  2. Update saga/PM handlers to use destination_sequences map
  3. Replace EventBook iteration with sequence stamping
  4. Test with explicit divergence if using editions

The cascade changes are additive—no migration required unless you’re adopting 2PC.

With standalone mode removed, the framework is fully committed to the distributed architecture. Upcoming work includes:

  • Multi-language client SDK stabilization (Go, Python, Rust, Java)
  • Snapshot retention policies (migration added in 0.3.0)
  • Enhanced cascade conflict resolution strategies

Full changelog: e1335ddc…908f9aad

Mutation Testing: The Deterministic Arbiter of LLM-Generated Tests

Last week I argued that you should build deterministic systems with non-deterministic tools—demand TDD from your LLM, get tests first, then implementation. But there’s a problem with that workflow: passing tests aren’t proof that tests are good.

Enter mutation testing: a deterministic tool that validates whether your tests actually test anything.

The previous post established a workflow: LLM writes tests, you review them, LLM implements, tests pass. But consider this test:

#[test]
fn test_reservation_prevents_double_booking() {
let mut player = Player::new(500);
player.reserve(300).unwrap();
// This test "passes" but proves nothing
assert!(true);
}

The test runs. It passes. Code coverage tools say the reserve function was called. Everything looks green. But the test validates nothing—you could delete the entire reserve implementation and this test would still pass.

An LLM optimizing for “make the tests pass” might produce exactly this kind of hollow test. Not maliciously—the test looks reasonable at a glance. It calls the right functions. It has assertions. But the assertions don’t constrain the behavior.

This is where mutation testing enters.

Mutation testing systematically breaks your code and checks whether your tests notice.

The mutation testing tool:

  1. Parses your source code
  2. Generates “mutants”—small, targeted changes (flip a > to >=, change + to -, return early, etc.)
  3. Runs your tests against each mutant
  4. Reports which mutants survived (tests still passed despite broken code)

If a mutant survives, your tests don’t adequately cover that behavior. The test suite accepts code that’s demonstrably wrong.

Terminal window
cargo mutants --in-place -f src/player/reservation.rs

Output might show:

src/player/reservation.rs:42: replace `>` with `>=` in available_balance ... SURVIVED
src/player/reservation.rs:47: replace `Ok(())` with `Err(...)` ... KILLED

The first mutant survived—changing > to >= didn’t break any tests. That’s a gap in test coverage that code coverage metrics would never reveal.

LLMs are pattern matchers. They’ve seen thousands of test files and can generate plausible-looking tests at scale. But “plausible-looking” isn’t “meaningful.”

Consider the failure modes:

Tautological assertions. The LLM generates assertions that restate the setup rather than verify behavior:

let result = calculate(5, 3);
assert_eq!(result, calculate(5, 3)); // Always passes

Missing edge cases. The LLM tests the happy path but misses boundaries:

#[test]
fn test_withdraw() {
let mut account = Account::new(100);
account.withdraw(50).unwrap();
assert_eq!(account.balance(), 50);
}
// Never tests: withdraw(100), withdraw(101), withdraw(0)

Implementation-coupled tests. The LLM tests that code does what it does, not what it should do:

#[test]
fn test_hash() {
// Tests current behavior, not correct behavior
assert_eq!(hash("input"), 0x7a3f2b1c);
}

Mutation testing catches all of these. Tautological assertions don’t kill mutants. Missing edge cases leave mutation gaps. Implementation-coupled tests kill mutants but the wrong ones—they’re brittle to refactoring while missing actual bugs.

Integrate mutation testing into the TDD cycle:

  1. Describe what you want
  2. LLM writes tests (not implementation)
  3. You review: “Do these tests capture my requirements?”
  4. LLM implements to make tests pass
  5. Run mutation testing on the implementation
  6. Analyze survivors: Which behaviors aren’t actually tested?
  7. Iterate: Add tests that kill survivors, or accept the gap

Step 6 is the key addition. Mutation testing provides objective feedback: “Your tests claim to verify X, but they’d accept this broken version of X.”

This is a deterministic checkpoint in a non-deterministic workflow. The LLM might generate hollow tests. You might miss them in review. But the mutants don’t lie.

Mutation testing works best when business logic is isolated from infrastructure. That’s a core design principle of Angzarr: aggregates, sagas, and projectors contain pure business logic with no database calls, no network I/O, no framework dependencies. The coordinator handles infrastructure; your code handles decisions.

This isolation makes code easier to test—and easier to test meaningfully. When a function takes state and returns events, every branch is reachable without mocking. Mutation testing thrives in this environment.

Industry research provides concrete benchmarks for mutation kill rates:

ContextTarget Kill RateSource
Google production (at scale)87%+State of Mutation Testing at Google
Mature/production systems90%Pitest best practices
Initial adoption baseline70-80%Pitest best practices
Critical systems (payments, security)80-90%testRigor guide
Less critical areas60-70% acceptabletestRigor guide

A common shock: teams with 80-90% code coverage often discover mutation scores of only 30% when first adopting mutation testing. That’s the gap between “tests executed this code” and “tests verified this code.”

In this codebase, mutation testing revealed qualitative patterns consistent with the research:

Pure utility functions should target 80-90%+ kill rates. Functions that transform data without side effects are fully testable. If mutants survive, the tests are incomplete.

// merge.rs - pure logic
pub fn diff_state_fields(before: &Any, after: &Any) -> HashSet<String> {
// Every branch here should have a mutant-killing test
}

Framework glue tolerates lower rates. gRPC handlers that delegate to tested core logic don’t need exhaustive mutation coverage. Integration tests cover the composition. Mutation testing primarily targets unit tests; integration-heavy code may have hard-to-detect mutants.

Surviving mutants in logging are acceptable. If removing a debug!() call doesn’t break tests, that’s expected—logging is a side effect that doesn’t affect correctness.

Here’s what this looks like end-to-end (illustrative, not a real transcript):

You: "Add fund reservation to prevent double-booking"
LLM writes tests:
- test_reserve_reduces_available_balance
- test_reserve_fails_on_insufficient_funds
- test_release_restores_available_balance
You review: "Looks like it covers the requirements"
LLM implements Player::reserve() and Player::release()
Tests pass. Coverage: 100%
You run: cargo mutants -f src/player/reservation.rs
Results:
- 12 mutants killed
- 2 mutants survived:
- "replace > with >=" on line 42
- "remove bounds check" on line 47
Analysis: The boundary condition (exactly equal to available) isn't tested.
You: "Add a test for reserving exactly the available balance"
LLM adds: test_reserve_exact_available_balance_succeeds
Mutants re-run: 14/14 killed
Done.

The mutation testing step caught a gap that code coverage missed. The test suite now constrains the actual behavior, not just the happy path.

The original post argued: use non-deterministic tools to build deterministic artifacts. Tests are deterministic—they pass or fail reproducibly.

But tests can be deterministic while being worthless. A test that always passes is perfectly reproducible. It just doesn’t prove anything.

Mutation testing adds a second layer of deterministic verification: not just “do tests pass?” but “do tests actually constrain behavior?” The mutation tool doesn’t guess. It systematically breaks things and observes results. Either the tests catch the breakage or they don’t.

This is the deterministic arbiter you need when working with LLM-generated tests. The LLM can generate plausible tests at scale. Mutation testing determines whether those tests mean anything.

Mutation testing is slow. Running it on a full codebase can take hours. For incremental work:

Terminal window
# Only test the file you just changed
cargo mutants --in-place --timeout 120 -f src/player/reservation.rs
# Use feature flags if your tests need them
cargo mutants --in-place -f src/player/reservation.rs -- --features "sqlite test-utils"

The timeout matters—some mutants create infinite loops. 120 seconds catches most real tests while killing pathological cases.

Run mutation testing:

  • After writing new tests (before considering them “done”)
  • When LLM claims high test coverage
  • Before merging significant new functionality
  • When you’re suspicious that tests are hollow

Don’t run it on every commit. That’s overkill. Run it when you need confidence that tests are meaningful.

The previous post established: LLMs draft, humans verify through tests.

This post adds: tests themselves need verification. Mutation testing provides that verification deterministically.

The workflow becomes:

  1. LLM generates tests (constrain before implementing)
  2. You review tests (verify they capture intent)
  3. LLM implements (make tests pass)
  4. Mutation testing validates tests (prove they constrain behavior)
  5. Iterate until mutants are killed

The LLM accelerates drafting. Tests verify the draft. Mutation testing verifies the tests. Each layer is more deterministic than the last, building reliable systems from unreliable components.


Yes, the tests for this post were also mutation-tested. The surviving mutants were in the prose.

DDD: Domains Sized to Contain Decisions

The uncomfortable truth: most DDD teams draw their bounded contexts too small.

Not too large—too small. They slice by CRUD entity, by database table, by team org chart. The result? Contexts that cannot make decisions autonomously. Every meaningful operation requires cross-context coordination. The architecture devolves into a distributed monolith with extra network hops.

This post argues for a different principle: a bounded context is correctly sized when every decision that changes its invariants can be made entirely within it, without synchronous runtime dependency on another context.

Eric Evans defined a bounded context as having “a unified model—that is, internally consistent with no contradictions.” He specified that teams should “explicitly define the context within which a model applies… keep the model strictly consistent within these bounds” 1.

But what does “unified” and “consistent” mean in practice?

Here’s the test: Can this context enforce its own business rules without calling out?

If the answer is “we need to ask the Orders context before we can validate a Payment,” then either:

  1. The Payment context is undersized, or
  2. The concepts are in the wrong context entirely

This maps to Evans’ idea that the model is the decision-making unit, not the data-holding unit. A context doesn’t exist to hold data—it exists to make decisions about that data.

Seven Principles for Decision-Containing Contexts

Section titled “Seven Principles for Decision-Containing Contexts”

Vaughn Vernon defines an invariant as “a business rule that must always be consistent, specifically referring to transactional consistency.” He states: “A properly designed Aggregate is one that can be modified in any way required by the business with its invariants completely consistent within a single transaction” 3.

The implication: every business invariant must have exactly one context that owns and enforces it. If two contexts share enforcement of the same rule, you have hidden coupling—a seam that will cause consistency bugs under load.

In Angzarr: Each aggregate enforces its invariants in @handles methods on a CommandHandler<State> subclass. The aggregate receives commands, validates against current state, and emits events—all within a single transaction. Cross-aggregate coordination happens asynchronously via sagas, never synchronously within a command handler.

Fowler, interpreting Evans, notes: “Usually the dominant factor drawing boundaries between contexts is human culture—since models act as ubiquitous language, you need a different model when the language changes. Different groups of people will use subtly different vocabularies in different parts of a large organization” 1.

The practical test:

  • When two teams use the same word to mean different things → context boundary
  • When one team explains concepts using another team’s vocabulary → wrong context

Evans himself has clarified that “one confusion teams often have is differentiating between bounded contexts and subdomains. In an ideal world they coincide, but in reality they are often misaligned” 5.

Vernon provides the architectural pattern: “There is a practical way to support eventual consistency in a DDD model. An Aggregate command method publishes a Domain Event that is in time delivered to one or more asynchronous subscribers. Each subscriber then retrieves a different yet corresponding Aggregate instance and executes its behavior based on it, each in a separate transaction” 3.

Microsoft’s architecture guidance reinforces this: “When a business process spans multiple aggregates, use domain events rather than a single transaction. Reference other aggregates by identity only—this decoupling maps directly to microservice boundaries” 4.

The principle: prefer eventual consistency across context boundaries over synchronous consistency. If strong consistency is required between two aggregates at runtime, they probably belong in the same context—or your transaction boundary is wrong.

In Angzarr: Aggregates modify only themselves per transaction. Cross-domain communication flows through sagas (stateless translation) or process managers (stateful coordination). Both typically operate asynchronously on committed events.

Angzarr does support synchronous modes for cross-domain calls, but discourages their use—they reintroduce the coupling and availability problems eventual consistency solves. Use sync modes only when business requirements genuinely demand it and you’ve accepted the tradeoffs.

4. Aggregate as Unit of Transactional Consistency

Section titled “4. Aggregate as Unit of Transactional Consistency”

Vernon is explicit: “The consistency boundary logically asserts that everything inside adheres to a specific set of business invariant rules no matter what operations are performed. The consistency of everything outside this boundary is irrelevant to the Aggregate. Aggregates are chiefly about consistency boundaries and not driven by a desire to design object graphs” 3.

ArchiLab reinforces: “A properly designed Aggregate is one that can be modified in any way required by the business with its invariants completely consistent within a single transaction. The consequence of this is that in one transaction, you can only modify one aggregate and never more than one aggregate” 11.

The aggregate boundary is not the context boundary—but it’s a lower bound. A context should contain all aggregates whose invariants reference each other.

In Angzarr: Each domain maps to exactly one aggregate type. Each aggregate instance is identified by {domain}:{root_id}. Multiple Angzarr domains may belong to the same DDD bounded context—they share ubiquitous language and team ownership, but are separate deployment units connected by sagas.

If aggregates share invariants, they either belong in the same aggregate (larger boundary) or require explicit coordination via sagas. Angzarr makes this choice visible in infrastructure rather than hiding it in code organization.

5. Anti-Corruption Layer as a Smell at Scale

Section titled “5. Anti-Corruption Layer as a Smell at Scale”

The Anti-Corruption Layer is the integration pattern where a downstream bounded context translates concepts from an upstream context, protecting its own model from the upstream’s influence.

ACLs are correct and necessary at integration points. But if a context needs a thick ACL—translating many concepts—the boundary may warrant re-examination. Sometimes the downstream context is missing concepts it should own; sometimes the upstream context is leaking internal details; sometimes it’s unavoidable legacy integration.

In Angzarr: Sagas connect Angzarr domains, but not all sagas are ACLs. The distinction:

  • Internal coordination sagas: Connect domains within the same bounded context. Shared ubiquitous language means minimal translation—mostly routing.
  • ACL sagas: Cross bounded context boundaries. Different teams, different language. Some translation expected.

The thickness of translation is the signal:

  • Thin ACL (mapping a few concepts): Normal and expected when crossing BC boundaries
  • Thick ACL (translating many concepts, complex mappings): Smell—suggests the boundary is in the wrong place or concepts are in the wrong context

6. Commands Stay Local, Events Cross Boundaries

Section titled “6. Commands Stay Local, Events Cross Boundaries”

The pattern is clear in the literature: domain events stay within the bounded context; integration events are the public contracts for cross-context communication. Commands express intent, and aggregates enforce rules.

Microsoft’s guidance distinguishes domain events (internal notifications) from integration events (cross-context asynchronous communication) 4.

A well-sized context accepts commands and enforces rules locally. It publishes domain events for others to react to. If a context issues commands into another context to complete its own operation, the command’s logic belongs in the first context.

In Angzarr: Aggregates accept commands and emit events. Sagas translate events from one domain into facts (or commands) for another. The default saga output is facts—events the receiving domain must accept.

Whether a saga is “translation” (ACL) or “routing” (internal coordination) depends on whether the domains share a bounded context. Angzarr doesn’t enforce this—it’s an organizational decision tracked via K8s labels:

labels:
angzarr.io/bounded-context: "game-ops"
angzarr.io/saga-type: "acl" # or "internal"

This makes the distinction queryable and enforceable via policy. ACLs crossing context boundaries justify heavy translation logic; internal sagas should be thin.

Fowler states: “Domain-Driven Design plays a role with Conway’s Law in helping define organization structures, since a key part of DDD is to identify Bounded Contexts. A key characteristic of a Bounded Context is that it has its own Ubiquitous Language, defined and understood by the group of people working in that context. The key thing to remember about Conway’s Law is that the modular decomposition of a system and the decomposition of the development organization must be done together” 9.

Steve Smith (Ardalis) reinforces this: teams and bounded contexts should correlate, since cross-team ownership of a context risks applying the wrong assumptions or model 10.

Microsoft provides the operational guidance: “If a single team must own multiple unrelated bounded contexts, or a single bounded context requires coordination across many teams, revisit either the boundaries or the team structure” 4.

Domain boundaries should align with team ownership boundaries. A context that spans two teams without a clear seam will degrade—the ubiquitous language will fork, and the model will develop inconsistencies that mirror org chart politics.

A context that owns data but no decisions. All business logic lives in an application service that orchestrates across multiple contexts. Looks like a context, acts like a database table.

This is the context-level manifestation of the anemic domain model anti-pattern: domain objects that contain little or no business logic, serving primarily as data structures while business logic lives in separate service layers.

One context is sized to “fully contain decisions” by absorbing everything. Correct principle, wrong solution.

Evans himself warned that “total unification of the domain model for a large system will not be feasible or cost-effective” 1. The fix is decomposing by subdomain (core, supporting, generic) and finding the natural seams—not abandoning the decision containment principle.

An aggregate that enforces invariants but references foreign IDs without local projections, so any validation requires an outbound call.

Vernon explicitly warns against this: “Large aggregates are an anti-pattern. A large-cluster Aggregate will never perform or scale well, and is more likely to fail because false invariants and compositional convenience drove the design, to the detriment of transactional success, performance, and scalability” 11.

The aggregate boundary is wrong, not the context boundary.

In Angzarr: Aggregates may query external, non-event systems (third-party APIs, legacy databases) during command handling to gather decision-making information. But they should only read—never write.

The better pattern: external systems holding state relevant to an aggregate should inject that context as facts into the aggregate, rather than the aggregate pulling it. Push beats pull—the aggregate’s state becomes self-contained, and you avoid synchronous dependencies during command handling.

If your aggregate needs data from another Angzarr domain to validate, that’s a smell. Either project that data locally, adjust the aggregate boundary, or reconsider whether the decision belongs in a different aggregate.

A single business capability is split into two contexts before the model is stable—typically because of team structure. The two halves immediately develop tight coupling because the model isn’t ready to be separated.

Evans has warned against “the bandwagon effect of jumping into microservices and bounded context splits.” He notes “a common misconception is that a microservice is a bounded context, which he calls an oversimplification. When subdomains and bounded contexts are misaligned—such as when a business reorganization creates new subdomains that don’t match existing bounded contexts—this often results in two teams having to work in the same context with increasing risk of ending up with a big ball of mud” 5.

Practitioner wisdom: keep the model in one context longer than feels comfortable, until the language stabilizes. That “longer than comfortable” state? It may be your legacy system—many monoliths are exactly this, never split because the language never stabilized. That’s not always wrong; sometimes the domain genuinely is one context.

Most of these require architecture review rather than automated measurement, but several can be approximated from code and incident data.

MetricHealthy SignalWarning Signal
Cross-context synchronous calls per operationFewMany
Shared database tables between contextsNoneAny
Aggregate references to foreign-context IDs without local copyRareCommon—suggests incomplete model
ACL translation surface (# of concepts mapped)ThinThick
Number of context owners per business capabilityOneMultiple

These are heuristics, not empirically-derived thresholds. “Few” vs “many” depends on your latency budget and availability requirements. The point is directional: more cross-context coupling = more boundary debt.

Bounded context sizing is a Goldilocks problem:

  • Too small: Contexts can’t make decisions alone, requiring constant cross-context coordination (the main thesis of this post)
  • Too large: Contexts become unmaintainable, language diverges internally, teams step on each other (the “God Context” failure mode)
  • Just right: Each context contains the decisions it needs to make, no more

Microsoft’s guidance: “Design aggregates to be no smaller than what is required to enforce an invariant within a single transaction. Include only the data that must remain consistent within a single transaction. When you combine unrelated aggregates, you force unrelated updates to compete for the same locks” 4.

MetricWhat It Reveals
Blast radius of a context failureHow many business capabilities fail when this context is unavailable—high blast radius suggests context is too large
Deployment coupling frequencyHow often does deploying context A require coordinating with context B? Frequent coordination = implicit coupling
Cross-context incident correlationWhen context A degrades, does context B degrade? Correlated failures suggest hidden coupling
Time to make a model changeLong time = concept is contested or shared across contexts

These metrics align with DORA research and Team Topologies guidance on measuring team and system boundary alignment 13.

For each key business decision the domain owns, ask:

  1. Does making this decision require data from another context at runtime? (synchronous query = −1)
  2. Does enforcing the resulting invariant require another context’s cooperation? (−1)
  3. Does rolling back a failed decision require coordinating with another context? (−1)

A score of 0 across all decisions is the target. Anything below −1 per decision indicates boundary misalignment.

Start with subdomains, not microservices. There are three types of subdomains: “Core, Supporting, and Generic. The Core subdomain is where the business must put its best efforts and provides competitive advantage. The Supporting subdomain complements the main domain. The Generic subdomain is typically handled by ready-made commercial or open-source software” 14. Subdomain analysis gives you the strategic cuts first. Bounded contexts then follow subdomain contours.

A context should be deployable and operable by one team. Not one person, not five teams. “Architectural and team evolution must go hand-in-hand throughout the life of an enterprise” 9.

The model should fit in one person’s head. If explaining the context’s model requires a two-hour meeting, it’s too large.

Event volume is not a sizing signal. High event throughput is a scaling concern, not a domain boundary concern.

The principles above represent mainstream DDD thinking. But having built Angzarr—an event-sourcing framework for distributed systems—I’ve encountered cases where rigid adherence to these rules creates its own problems.

Flexibility Has Consequences

Angzarr aims to be fast, reliable, and flexible. That flexibility permits building terrible systems:

  • Synchronous cascades across dozens of aggregates—causing performance problems and availability nightmares
  • Poor aggregate factoring—undersized aggregates that can’t make decisions alone cause explosions in cross-domain messages and degraded performance
  • Sagas emitting commands—the mechanism for cross-aggregate decisions, but adds compensation complexity; overuse often signals poor aggregate factoring
  • God process managers—PMs that orchestrate everything become a single point of failure and a coordination bottleneck; decision logic belongs in aggregates, not PMs
  • Ignoring every principle in this post—Angzarr won’t stop you

The thesis of this post applies to Angzarr itself: aggregates should make decisions with minimum external contact. Violate that, and you’ll pay in latency, throughput, and operational complexity.

Sometimes these anti-patterns are necessary—even the right choice for your constraints. Angzarr supports them for that reason. But it takes no responsibility for the consequences. We warn you in documentation and, often, in code—make sure you’re choosing the tradeoff deliberately, not accidentally.

Here’s the uncomfortable truth the literature rarely addresses: DDD boundaries are architecture, and architecture is expensive to change.

Conway’s Law cuts both ways. Yes, system structure should align with team structure. But once it does, that alignment becomes load-bearing. Refactoring a bounded context boundary likely means some combination of:

  • Reorganizing teams (politics, HR, reporting structures)
  • Migrating data between stores (downtime, consistency risks)
  • Rewriting integration contracts (coordinated deployments)
  • Updating monitoring, alerting, and runbooks (operational knowledge)

The literature says “if your context can’t make decisions autonomously, it’s undersized—fix it.” That’s correct in principle. But fixing it may require executive buy-in, a migration project, and months of coordination. Meanwhile, the business needs to ship features.

Angzarr takes a pragmatic stance: support sub-ideal boundaries with tooling when refactoring isn’t feasible.

This isn’t an endorsement of bad architecture. It’s an acknowledgment that production systems exist, Conway’s Law has inertia, and sometimes the operationally necessary choice is to work within existing constraints while planning longer-term improvements.

The “Commands Stay Local” Oversimplification

Section titled “The “Commands Stay Local” Oversimplification”

The principle that commands should stay local while only events cross boundaries is elegant in theory. In practice, it can force awkward aggregate designs.

Consider a saga that translates an OrderCompleted event into fulfillment work. The fulfillment domain needs to create a shipment. Under strict “events only” thinking, the saga should publish an event like FulfillmentRequested, which the fulfillment context reacts to.

But what happens when fulfillment fails? The saga has no mechanism to compensate—it fired an event and walked away. The fulfillment context now owns the problem entirely, even though the business process spans both domains.

Angzarr takes a different approach. Sagas can emit either commands or facts to other aggregates:

Facts (the default for saga output): Events injected without a preceding command. Structurally, facts are just events with two differences: the receiving domain assigns the sequence number, and they retain source traceability metadata. The receiving aggregate cannot reject them; they represent external realities. Example: “the hand says it’s your turn” is a fact the player aggregate must accept.

Commands (when compensation is needed): Requests that the receiving aggregate can reject. Use commands when:

  1. The receiving aggregate should be able to refuse (insufficient inventory, invalid state)
  2. Rejection must trigger compensation in the originating domain
  3. The saga uses destination state to make business decisions

Both patterns require:

  • The saga uses the destination aggregate’s state to inform decisions
  • Sequence validation ensures exactly-once delivery semantics

General guidance: prefer facts for saga output unless you need rejection/compensation capability. Facts are simpler—they represent “this happened” rather than “please do this.” Commands add complexity but enable explicit failure handling.

A warning: If you find yourself reaching for saga-emitted commands frequently, pause and ask whether your bounded contexts are correctly sized. A saga that needs to send commands with compensation capability may be a signal that:

  • The two aggregates belong in the same context (shared invariants requiring coordination)
  • The decision logic is in the wrong aggregate (should move upstream)
  • A process manager is more appropriate than a saga (Angzarr’s process managers are stateful, use the correlation ID as their aggregate root, and explicitly coordinate multi-domain workflows)

Angzarr supports the pattern because sometimes it’s genuinely correct. But “supported” doesn’t mean “encouraged.” Treat saga-emitted commands as a code smell worth investigating, even when it’s the right solution.

This nuance isn’t captured by the simple “commands stay local, events cross boundaries” rule. The question isn’t command-vs-event—it’s whether the receiving domain has veto power over the incoming information.

Vernon’s guidance to keep aggregates small—containing only what’s needed for invariant enforcement in a single transaction—is sound. But it can be taken too far.

An aggregate that’s too small becomes a data container that delegates all decisions outward. Every validation requires a saga or process manager to orchestrate across aggregates. You’ve achieved small aggregates at the cost of coherent decision-making.

The opposite risk is real too: an aggregate that absorbs everything becomes a bottleneck. But in my experience, teams more often err toward undersized aggregates than oversized ones—particularly when influenced by microservices culture that conflates “small services” with good architecture.

The test isn’t aggregate size. It’s: can this aggregate make its decisions without runtime dependencies?

Angzarr supports patterns the literature flags as anti-patterns—but discourages them:

PatternOrthodox ViewAngzarr’s Position
Sagas emitting commandsAvoid—events onlySupported but discouraged; prefer facts
Cascading synchronous callsNeverSupported for legacy boundaries; refactor when possible
Undersized aggregates requiring coordinationAnti-patternSupported with compensation tooling; indicates boundary debt
”Large” aggregatesAnti-patternSometimes correct—if the aggregate owns a cohesive set of decisions

These are escape hatches, not recommended patterns. Each represents technical debt—a workaround for boundaries that should ideally be redrawn. Angzarr provides the tooling because:

  1. Production systems exist. You inherited boundaries drawn by someone else, possibly years ago.
  2. Conway’s Law has inertia. Fixing the architecture may require fixing the org chart first.
  3. Business doesn’t wait. Features ship while migration projects are planned.

The correct response to needing these patterns is:

  1. Use them to unblock the immediate work
  2. Document the boundary debt
  3. Plan the refactoring (even if it’s quarters away)
  4. Don’t let “supported” become “normalized”

The literature provides excellent defaults. When you deviate, know why you’re deviating and have a plan to stop.

The underlying principle remains: size your contexts and aggregates to contain decisions. When you can’t—because the boundaries are already drawn and load-bearing—Angzarr helps you cope. But coping isn’t thriving. Fix the boundaries when you can.

A domain boundary is a decision boundary, not a data boundary or a service boundary. Draw it where decisions are made, not where data lives.

Most teams err by slicing too thin—creating contexts that own data but cannot decide. The result is an architecture where every operation requires coordination, every deployment requires synchronization, and the system exhibits all the costs of distribution with none of the benefits of autonomy.

Size your contexts to contain decisions. If a context cannot enforce its invariants alone, it’s too small.


[1] Martin Fowler, “BoundedContext,” martinfowler.com (includes Evans quotations)

[3] Vaughn Vernon, “Effective Aggregate Design,” Parts I–III, dddcommunity.org (2011); also Implementing Domain-Driven Design (2013), Addison-Wesley

[4] Microsoft Azure Architecture Center, “Design a DDD-oriented microservice,” docs.microsoft.com

[5] Eric Evans at DDD Europe 2019, as covered by InfoQ

[9] Martin Fowler, “Conway’s Law,” martinfowler.com

[10] Steve Smith (Ardalis), writings on bounded contexts and team organization

[11] ArchiLab, aggregate design based on Vernon’s work

[13] Team Topologies literature on team/system boundary alignment

[14] DDD community resources on subdomain classification


The following claims in this post represent synthesis from the cited sources and practitioner experience, rather than direct quotations:

[†] The claim that “two contexts enforcing the same invariant produces hidden coupling” is extrapolated from Vernon’s consistency boundary rules.

[†] The interpretation that thick ACLs warrant boundary re-examination is practitioner intuition, not a direct Evans/Vernon citation.

[†] The extension of the anemic domain model anti-pattern to the context level is derived analysis.

[†] “God Context” as a named failure mode is framing for this post, not canonical DDD terminology.

[†] Common practitioner advice without verified primary source.

[†] The structural metrics table is synthesized from the cited sources and practitioner literature, not a canonical table from Evans or Vernon.

[†] The Decision Containment Score is a framework constructed for this post as an operationalization of the decision containment principle. It does not appear in the primary sources.

[†] Common practitioner wisdom without verified primary source.

[†] “Event volume is not a sizing signal” is synthesis for this post without primary source support.

Building Deterministic Systems with Non-Deterministic Tools

Large Language Models are probabilistic text generators. Their raw outputs cannot be trusted for correctness. So how do you build reliable software with unreliable assistants?

You don’t ask for answers. You ask for tools that produce answers.

Large Language Models (LLMs)—the technology behind ChatGPT, Claude, and similar AI assistants—are probabilistic text generators. They predict the next most likely token based on patterns learned from training data. This makes them remarkably useful for many tasks, but it also means their raw outputs cannot be trusted for correctness.

Ask an LLM to calculate something, and it might be right. Or it might confidently produce nonsense. Ask it again, and you might get a different answer. This non-determinism is a feature of how these systems work, not a bug to be fixed.

So how do you build reliable software with unreliable assistants?

You don’t ask for answers. You ask for tools that produce answers.

Consider the difference:

Wrong approach:

“What is the sum of all prime numbers under 1000?”

The LLM will likely produce an answer. It might even be correct. But you have no way to verify it without doing the work yourself.

Right approach:

“Write a function that identifies prime numbers, then use it to sum all primes under 1000. Include tests.”

Now you have:

  1. Code you can read and understand
  2. Tests that verify the logic
  3. A tool you can re-run with different inputs
  4. Something deterministic built from something non-deterministic

The LLM’s non-determinism is contained to the code generation step. Once the code exists and tests pass, the system behaves predictably.

This is non-negotiable: require test-driven development from your LLM.

Not “write tests.” Not “include tests.” Write the tests first, get my approval, then implement.

Here’s why this matters for non-deterministic systems:

Tests are a contract. When the LLM writes tests first, it’s forced to articulate what it thinks you want. You review that articulation before any implementation exists. Misunderstandings surface when they’re cheap to fix—before hundreds of lines of code encode the wrong assumptions.

Tests constrain the solution space. An LLM with a blank canvas will produce something. An LLM with failing tests to satisfy has a target. The non-determinism still exists, but it’s bounded by concrete assertions.

Tests are reviewable by humans. Implementation code requires understanding algorithms, data structures, edge cases. Test code requires understanding intent: “when X happens, Y should result.” You can review whether tests capture your requirements without being an expert in the implementation language.

The workflow:

  1. Describe what you want
  2. LLM writes tests (not implementation)
  3. You review: “Do these tests capture my requirements?”
  4. Iterate until tests are correct
  5. LLM implements to make tests pass
  6. You verify tests actually pass

If the LLM writes implementation before tests, reject it. “Stop. Tests first. Show me what you think success looks like before you show me how to achieve it.”

This isn’t pedantry. It’s the difference between reviewing a blueprint and reviewing a finished building. One is cheap to change. The other isn’t.

When demanding TDD, demand that tests document the problem, not just the solution:

def test_reservation_prevents_double_spending():
"""
Problem: Players could join multiple poker tables with the same bankroll,
creating settlement disputes when they lose at both tables simultaneously.
Solution: Fund reservation locks a portion of the bankroll, making it
unavailable for other reservations until released.
This test verifies that a second reservation fails when insufficient
unreserved funds remain.
"""
player = Player(bankroll=500)
player.reserve(300) # First table
with pytest.raises(InsufficientFunds):
player.reserve(300) # Second table - should fail
assert player.available_balance == 200

The docstring explains:

  • What problem exists (double-spending across tables)
  • Why this solution (fund locking)
  • What this specific test validates (second reservation fails)

The test code shows how the solution works.

This transforms tests from “verification that code works” into “documentation of why code exists.” The test docstring is the right place for explanations—it’s coupled to the behavior it describes and breaks visibly when the behavior changes.

TDD handles code generation. But what about understanding existing code?

The illuminated code walkthrough is a collaborative reading pattern where AI narrates execution flow while you read the code. Like illuminated manuscripts with their explanatory marginalia, the AI provides context and commentary that helps you understand what you’re seeing—without that commentary becoming permanent (and eventually stale) documentation.

Start with flows, not files. The most valuable walkthroughs trace execution paths: “Walk me through what happens when a user places an order” or “Step through the integration test for hand completion.” You follow complete paths from entry point through all possible endings.

The AI narrates: “The OrderCompleted event triggers the fulfillment saga, which emits a CreateShipment command to the fulfillment aggregate, which…” You read each piece of code as it becomes relevant, understanding the full path rather than isolated functions.

One step at a time. The AI presents each function or handler in execution order, not file order. You see the code in the sequence it actually runs.

AI explains as you go. What data flows in? What transforms? What side effects occur? The AI provides narrative while you read, connecting each step to the last.

AI questions unusual patterns. Not just description—interrogation. “This saga assumes the inventory check already passed, but I don’t see where that’s enforced.” The AI acts as a second set of eyes on the flow, not just the code.

You control the pace. The AI asks “Changes, or continue?” Don’t proceed until you understand how this step connects to the whole.

The interaction:

  1. You name the flow: “Walk me through the table-to-hand event flow”
  2. AI presents the entry point with context
  3. AI follows execution to the next handler, explaining the transition
  4. AI flags potential issues in the flow
  5. AI asks: “Changes, or continue?”
  6. Repeat until the flow completes

This works especially well with integration tests. The test defines the scenario; the illuminated walkthrough reveals every step of execution that makes the test pass. You understand not just that it works, but how it works—and whether the “how” matches your mental model.

Crucially, you’re validating the AI’s understanding in real-time. When it misexplains a transition or loses the thread, you catch it immediately. This trains your calibration of when to trust its output and when to dig deeper.

The illumination is ephemeral by design. It helps you understand the code now. Don’t paste it into comments—as the code changes, the explanations become stale lies. The test docstrings are your durable documentation; the illuminated walkthrough is scaffolding you discard when the session ends.

1. Tests first, always

Whether generating new code or reviewing existing code, start with tests. For generation: “Write tests first, then implement.” For review: “Walk me through the tests, then the implementation.”

2. Require problem documentation in tests

Every test function should document the specific problem it validates. This is the right place for durable explanations—coupled to behavior, visible when behavior changes.

3. Let the illumination be ephemeral

AI explanations during illuminated walkthroughs help you understand code in the moment. Don’t preserve them as comments—they’ll rot. Use them, then let them go.

4. Verify incrementally

Don’t let the LLM write 500 lines before you review. Small batches, frequent verification. Errors compound.

5. Run everything

Actually execute the tests. Actually check the output. “It should work” is not the same as “it works.”

LLMs are tools for two things:

  1. Generating artifacts—code, tests
  2. Providing narrative—explanations, analysis, questions

The artifacts can be deterministic even when the generation process isn’t. The narrative helps you understand but shouldn’t be preserved—it’s tied to a moment in time, not to the code itself.

Your job is to:

  1. Demand tests first (constrain before implementing)
  2. Review the tests (verify they capture intent)
  3. Verify the artifacts (run, don’t assume)
  4. Use the narrative to understand (then let it go)
  5. Put durable documentation in test docstrings (coupled to behavior)

The LLM accelerates the drafting and illuminates the reading. You ensure the correctness.

This isn’t a limitation to work around. It’s the appropriate division of labor between a probabilistic generator and a human who needs reliable systems.


The irony of this post being written with AI assistance is not lost on me. The difference: I reviewed every claim, verified it matched my experience, and take responsibility for the result. That’s the model.

Plan, Review, Execute: Getting Better Results from LLMs

The most effective LLM workflows share one trait: they force a pause between planning and execution. You wouldn’t let a contractor start demolition before approving blueprints. The same applies to AI assistants.

LLMs are biased toward action. Given a task, they want to produce output immediately. This leads to:

  • Implementations that don’t match your mental model
  • Refactoring that introduces patterns you don’t want
  • Solutions to problems you didn’t actually have

The fix isn’t more detailed prompts. It’s workflow structure.

Before any implementation, require a plan. Not pseudocode, not a summary of what the LLM intends to do. A concrete list of files to touch, functions to modify, and decisions that need your input.

The plan itself isn’t the value. The review is.

Plans expose assumptions. An LLM might assume you want bcrypt when you’re using Argon2, or assume PostgreSQL when you’re on SQLite. Catching this before code exists saves hours.

More importantly, plans surface questions the LLM should ask but often doesn’t. “Should this be configurable?” and “What happens on failure?” are questions better asked before implementation than discovered during code review.

For existing code, planning becomes reviewing. The illuminated code walkthrough applies the same checkpoint principle: AI narrates execution flow one step at a time while you read along, controlling the pace.

The interaction:

  1. AI presents a function or handler with explanation
  2. AI flags potential issues
  3. AI asks: “Changes, or continue?”
  4. Human responds
  5. Repeat

This works especially well when tracing integration tests or application flows—you follow complete paths from entry point through all possible endings.

For a deeper treatment of illuminated walkthroughs and how they fit with test-driven development, see Building Deterministic Systems with Non-Deterministic Tools.

Long reviews span multiple sessions. Track progress with a simple status document:

  • List of files/functions to review
  • Checkmarks for completed items
  • Notes on decisions made
  • Questions to revisit

Keep this gitignored. It’s session state, not documentation.

Codebase onboarding. Illuminate key flows with the AI narrating as you read.

Code review with approval gates. When every change needs sign-off before the next.

Refactoring sessions. Make one change, verify it works, move to the next.

Teaching and learning. Slow pace with space for questions beats firehose explanations.

LLMs work best with feedback loops, not fire-and-forget prompts. Plan mode creates one checkpoint. Illuminated walkthroughs create many. Both share the same principle: you can’t review what you haven’t seen.

Build the pause into your workflow. The LLM will produce better work, and you’ll catch problems before they become expensive.

Testcontainers Blur the Lines Between Unit and Integration Tests

The old unit/integration distinction assumed “integration” meant “slow, fragile, needs environment setup.” Testcontainers changed the economics.

We used to draw a hard line between unit tests and integration tests:

  • Unit tests: Fast, no external dependencies, run anywhere, colocate with code
  • Integration tests: Slow, need databases/queues/services, run in CI, separate directory

This separation made sense when “integration test” meant “spin up a full environment.” You wouldn’t colocate tests that require PostgreSQL next to your repository implementation; they’d fail on every developer’s machine without the right setup.

Testcontainers (in Rust: testcontainers-rs) spins up real infrastructure in Docker containers, on demand, per test.

#[test]
fn bit_event_store_persists_events() {
let container = PostgresContainer::new();
let pool = connect_to(&container);
let store = PostgresEventStore::new(pool);
store.append("order-123", vec![event]).unwrap();
let events = store.read("order-123").unwrap();
assert_eq!(events.len(), 1);
}

This test spins up a real PostgreSQL instance in Docker, runs the test against it, and tears it down. No shared database. No environment configuration. No “works on my machine.” The container is ephemeral, isolated, and automatic.

Behavioral Interface Tests (BITs) Fit Here

Section titled “Behavioral Interface Tests (BITs) Fit Here”

We call these Behavioral Interface Tests (BITs): tests that verify an implementation correctly fulfills its interface’s behavioral contract. Tests that verify trait implementations (EventStore, SnapshotStore, MessageBus) are BITs—not “integration tests” in the traditional sense.

These tests should live near the implementation:

src/
├── storage/
│ ├── postgres.rs # PostgresEventStore implementation
│ ├── postgres.bit.rs # BITs against real Postgres
│ ├── sqlite.rs
│ └── sqlite.bit.rs

The “real database” aspect doesn’t change where the test belongs. It’s still testing one module’s behavior. It’s still colocated. It just happens to need a container.

(Why “BIT”? It’s a pun. “The BIT caught a regression.” “That edge case BIT me.” Also: Behavioral Interface Test.)

The old unit/integration split was about how tests run. The better distinction is what they test.

Test TypeWhat It TestsWhere It Lives
UnitPure logic, no dependenciesAdjacent .test file
BITSingle implementation against its interfaceAdjacent .test file (with testcontainers)
IntegrationMultiple components interactingtests/ directory
End-to-endFull system behaviorSeparate test project

BITs with testcontainers are closer to unit tests than integration tests. They test one thing. They’re fast enough to run frequently. They should be colocated.

See Martin Fowler’s Practical Test Pyramid for more on scope-based test categorization.

Yes, testcontainer tests are slower than pure unit tests. On my machine, a PostgreSQL container adds ~2 seconds of startup. That’s too slow for “run on every save” but fine for “run before commit.”

We handle this with test categories:

#[test]
fn test_pure_logic() { /* runs always */ }
#[test]
#[cfg_attr(not(feature = "testcontainers"), ignore)]
fn test_postgres_storage() { /* runs with --features testcontainers */ }

Local development runs the fast tests continuously. Pre-commit hooks (we like Lefthook) and CI run everything. The slower tests are still colocated; they’re just conditionally executed.

Mocks Are for Boundaries, Not Implementations

Section titled “Mocks Are for Boundaries, Not Implementations”

This shift changed how I think about mocking. Previously, I’d mock the database to test repository logic. Now I test the repository against a real database (via testcontainers) and reserve mocks for:

  • External services I don’t control (third-party APIs)
  • Failure injection (simulate network errors)

If I can test against the real thing cheaply, I should. Testcontainers made “the real thing” cheap.

The unit/integration distinction was always about economics: unit tests were cheap, integration tests were expensive. Testcontainers collapsed that cost difference for many scenarios.

When the economics change, the categories should too. BITs against real infrastructure aren’t integration tests just because they touch a database. They’re colocatable, fast-enough, single-purpose tests that happen to need Docker.

Organize by what you’re testing, not by what tools you need to test it.


Prior art: This concept aligns with what some call “Behavioral Contract Testing” (jdecool.fr) and the Abstract Test pattern (testingpatterns.net). We prefer “BIT” because it’s punchier and avoids confusion with Consumer-Driven Contract testing (Pact, etc.).

Tests Belong Next to the Code They Test

Tests should live next to the code they test—same directory, separate file. Not inline. Not in a parallel tree.

src/
├── user_service.rs # Production code only
├── user_service.test.rs # Tests only
└── mod.rs

AI context windows changed my thinking. When an AI reads a 500-line file where 300 lines are tests, it wastes 60% of its context budget on code irrelevant to most tasks. Separate files let AI skip tests; inline tests force everything into context.

Java’s src/main/src/test split goes too far—that was a workaround for the JVM’s inability to exclude code at compile time. Modern languages (Rust, Go) solved this. We get colocation without the baggage.

The principle: Tests belong near code. The implementation: Same directory, separate file, clearly named (.test.rs, _test.go).


I used to prefer Rust’s #[cfg(test)] mod tests pattern: maximum colocation, one scroll shows everything.

Working with AI assistants changed my mind. Every token in an AI context window has a cost. Inline tests create noise: search for business logic, get hits in test assertions, fixtures, helpers. Ask an AI to understand authentication, it loads 47 test cases it doesn’t need.

The problem isn’t that tests exist. It’s that inline tests are in the way.

Separate files preserve colocation (one directory listing shows both) while enabling selective loading. AI tools skip .test files. Humans wanting documentation head for the tests. Choice instead of force.

Java’s src/main/src/test split was a workaround for tooling limitations, not a design choice for developer benefit.

The JVM’s class loading model forced physical separation:

  1. No conditional compilation. Unlike Rust’s #[cfg], Java can’t say “compile this class but exclude it from the JAR.” Every .class file could end up in production.

  2. Heavy test dependencies. JUnit, Mockito, assertion libraries add megabytes. You don’t want them shipped.

  3. Classpath-based loading. The only way to exclude code was to put it in a different directory and configure the packager to ignore it.

Maven’s Surefire plugin runs tests from target/test-classes. The JAR plugin packages from target/classes. They never overlap because the source directories never overlapped. Physical separation at source level cascades to physical separation everywhere.

my-project/
├── src/
│ ├── main/java/com/example/UserService.java
│ └── test/java/com/example/UserServiceTest.java
└── pom.xml

To find tests for UserService: up from src/main/java/com/example/, over to src/test/java/, back down through com/example/. That’s not “next to the code.” That’s an archaeological expedition.

.NET went further: separate assemblies.

MySolution/
├── MyApp/MyApp.csproj → MyApp.dll
├── MyApp.Tests/MyApp.Tests.csproj → MyApp.Tests.dll
└── MySolution.sln

Assembly references are explicit. NuGet packages are per-project. Deployment is per-assembly. The tooling expects tests far away from production code.

Both patterns solved real technical problems. But they created organizational ones we’ve been living with for decades.

Rust and Go proved smarter tooling removes the need for separation.

Rust’s #[cfg(test)] eliminates code at compile time:

pub struct UserService { /* ... */ }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_create_user() { /* ... */ }
}

In release builds, the test module doesn’t exist—not compiled, not linked, not present. Test dependencies ([dev-dependencies]) are only linked when building tests.

No deployment risk. No dependency contamination. No separate directories needed.

Go uses file naming: _test.go files are only compiled by go test.

mypackage/
├── user_service.go # Production code
└── user_service_test.go # Test code

No annotations, no configuration. The toolchain examines file names before compilation. go build skips test files entirely.

Both languages achieve colocation because the compiler handles what the filesystem used to handle.

We use Rust’s .test.rs pattern with the #[path] attribute:

src/
├── correlation.rs # Production code
├── correlation.test.rs # Tests
└── mod.rs
mod.rs
pub mod correlation;
#[cfg(test)]
#[path = "correlation.test.rs"]
mod correlation_tests;

This gives us:

  • Tests adjacent to code (same directory)
  • Production files focused on implementation
  • Test files skippable when reading for understanding
  • Conditional compilation via #[cfg(test)]
  • Clean mutation testing workflow

Mutation testing benefits: Separate files pair well with tools like cargo-mutants. If a mutation survives (accidentally gets committed), it’s in correlation.rs; the test file is untouched. Revert the production file, keep the tests. With inline tests, reverting means losing both mutated code and test improvements.

Test Support Files: When Production Needs Test Logic

Section titled “Test Support Files: When Production Needs Test Logic”

Sometimes production code needs to call test-specific logic—mock handlers, test fixtures, specialized parsers for test data. The #[cfg(test)] block inside the production function works, but what if it’s substantial? Inline test code pollutes the production file.

The solution: test support files using the same #[path] pattern.

src/orchestration/aggregate/
├── merge.rs # Production code (clean)
├── merge_test_support.rs # Test helpers (separate file)
├── tests.rs # Unit tests
└── mod.rs

The production file stays minimal:

// merge.rs - only 4 lines of test boilerplate
#[cfg(test)]
#[path = "merge_test_support.rs"]
pub(crate) mod test_support;
pub(crate) fn diff_state_fields(before: &Any, after: &Any) -> HashSet<String> {
// ...
#[cfg(test)]
if before.type_url == "test.StatefulState" {
return test_support::diff_test_state_fields(&before.value, &after.value);
}
// ... production logic
}

The test support file contains the helpers:

merge_test_support.rs
//! Test support for merge module.
//! Only compiled during tests via #[path] include.
pub(crate) fn parse_test_state_fields(s: &str) -> HashMap<String, String> { /* ... */ }
pub(crate) fn diff_test_state_fields(before: &[u8], after: &[u8]) -> HashSet<String> { /* ... */ }

Unit tests import from the support module:

tests.rs
use super::merge::test_support::{diff_test_state_fields, parse_test_state_fields};
#[test]
fn test_diff_detects_single_change() {
let changed = diff_test_state_fields(
r#"{"field_a":100,"field_b":200}"#.as_bytes(),
r#"{"field_a":100,"field_b":300}"#.as_bytes(),
);
assert!(changed.contains("field_b"));
}

When to use this pattern:

  • Production code needs conditional test behavior
  • Test helpers exceed ~20 lines
  • You want readers to see business logic, not test fixtures

Visibility note: Use pub(crate) if sibling test modules need access; pub(super) if only the parent module calls the helpers.

This reduced our merge.rs from ~300 lines to ~205 lines—all test code now lives in adjacent files, still colocated but not inline.

Context window impact: The same principle from the intro applies here. When an AI assistant reads merge.rs to understand commutative merge logic, it gets 205 lines of business logic—not 300 lines where a third is test fixture parsing. The _test_support.rs file exists for when context needs test helpers; otherwise it’s skipped. Every line of test code in a production file is a line competing for attention in a context window that could hold actual implementation details.

This isn’t absolutism. Some tests benefit from separation:

Integration tests exercising multiple modules belong in tests/. They’re not testing one file.

End-to-end tests spinning up the whole system are genuinely different. Different lifecycle, different dependencies.

Shared fixtures might warrant their own module—though I’d put them in src/test_utils/, not a parallel tree.

The principle: don’t separate without reason. Colocation is the default. Separation is a deliberate choice.

What’s lost with separate files:

  • Visibility. Inline tests were impossible to miss. Separate files require knowing to look.
  • Encouragement. Scroll down, see tests. With separate files, there’s an extra step.
  • Atomic versioning. Change function and test in one commit. Separate files technically allow drift.

What’s gained:

  • Cleaner production files. Implementation without test noise.
  • Efficient AI assistance. Context windows focused on relevant code.
  • Faster codebase search. Grep for logic, not test assertions.
  • Flexible reading. Choose when to engage with tests.

For me, in 2026, with AI assistants as daily collaborators, the tradeoff favors separate files.

Every position in this article emerged from tooling constraints of its era. Java’s parallel directories made sense when the JVM couldn’t exclude code. Rust’s inline tests made sense when file size didn’t compete with AI context budgets.

Tomorrow’s tradeoffs will differ. AI context windows will grow. IDE integrations will get smarter. When constraints change, optimal organization changes too.

What won’t change: tests belong near the code they test. The definition of “near” adapts to tooling. The principle doesn’t.

The Container Overlay Pattern: Same Makefile Command, Different Context

How we eliminated conditionals from our Makefile while supporting both host and containerized builds with a single command interface.

We wanted containerized builds for consistency across developer machines and CI. But every approach we tried had friction:

Dual Makefiles (Makefile and Makefile.docker): Works, but now everyone has to remember which file to use. Documentation says “run make -f Makefile.docker build” and someone inevitably runs make build instead.

Conditional detection: Check for /.dockerenv or an environment variable:

IN_DOCKER := $(shell test -f /.dockerenv && echo 1 || echo 0)
ifeq ($(IN_DOCKER),1)
build:
cargo build
else
build:
docker run ... make build
endif

This works but clutters the Makefile. Every target needs the conditional. The file becomes a maze of ifeq/else/endif blocks.

Different commands: make build on host, make container-build for Docker. Now you have parallel target names, duplicate documentation, and cognitive overhead.

We wanted something simpler: same command, different behavior based on context.

Docker bind mounts can replace individual files inside the container. The Docker documentation even mentions this—if you mount over an existing file, the original is “obscured.”

What if we mount a different Makefile over the host’s Makefile inside the container?

Two files, one interface:

project/
├── Makefile # Host version: delegates to container
└── Makefile.container # Container version: runs commands directly

Host Makefile starts the container and mounts the overlay:

DOCKER_RUN := docker run --rm \
-v ./:/workspace \
-v ./Makefile.container:/workspace/Makefile:ro \
-w /workspace \
myimage
build:
$(DOCKER_RUN) make build
test:
$(DOCKER_RUN) make test

Container Makefile runs commands directly:

build:
cargo build
test:
cargo test

The key line: -v ./Makefile.container:/workspace/Makefile:ro

This mounts Makefile.container over Makefile inside the container. When the container runs make build, it sees Makefile.container as Makefile.

The file swap is the detection mechanism.

  • On host: make build → runs Docker → mounts overlay → runs make build inside container
  • In container: make build → runs cargo build directly (because Makefile is now Makefile.container)

No conditionals. No environment variable checks. No remembering which command to run. The mount handles everything.

Before (conditional detection):

IN_DOCKER := $(shell test -f /.dockerenv && echo 1 || echo 0)
ifeq ($(IN_DOCKER),1)
build:
cargo build
test:
cargo test
lint:
cargo clippy
else
build:
docker run ... make build
test:
docker run ... make test
lint:
docker run ... make lint
endif

After (overlay pattern):

# Host Makefile - just delegation
build:
$(DOCKER_RUN) make build
test:
$(DOCKER_RUN) make test
lint:
$(DOCKER_RUN) make lint
# Container Makefile - just execution
build:
cargo build
test:
cargo test
lint:
cargo clippy

Same number of lines, but separated by concern. Host file handles orchestration. Container file handles execution. No mixing.

With separate Makefile and Makefile.docker, users must know which to invoke. CI scripts use one, developers might use another. Documentation has to explain both.

With the overlay pattern, there’s one command: make build. It works everywhere. The context determines the implementation.

Host Makefile responsibilities:

  • Container image selection
  • Volume mounts
  • Network configuration
  • Environment variables

Container Makefile responsibilities:

  • Compilation
  • Testing
  • Linting
  • Any actual build logic

These concerns don’t mix. When build logic changes, edit the container file. When container orchestration changes, edit the host file.

If you’re using VS Code devcontainers or similar, you might already be inside a container. Running Docker-in-Docker works but adds overhead.

Optional escape hatch:

ifdef DEVCONTAINER
DOCKER_RUN :=
else
DOCKER_RUN := docker run --rm -v ... myimage
endif
build:
$(DOCKER_RUN) make build

When DEVCONTAINER is set, DOCKER_RUN becomes empty and commands run directly. This is the one conditional we allow—and it’s optional.

Same pattern, swap docker for podman. We use Podman with the :Z SELinux flag:

PODMAN_RUN := podman run --rm \
-v ./:/workspace:Z \
-v ./Makefile.container:/workspace/Makefile:ro \
-w /workspace \
myimage

The pattern works identically with just—and the code is cleaner:

# justfile (host)
_run +ARGS:
podman run -v ./justfile.container:/workspace/justfile:ro ... just {{ARGS}}
build:
just _run build
# justfile.container
build:
cargo build

just’s module system (mod examples "examples/justfile") composes naturally—module commands route through the container transparently.

Compared to Make, just:

  • No .PHONY declarations
  • Shell variables work naturally ($(hostname) vs $$(hostname))
  • Recipes can take arguments (just _run build)
  • Better error messages
  • Cross-platform without gnumake vs BSD make quirks

The Make version works. The just version… just works.

(No affiliation with just—just a happy user.)

This pattern is better than the alternatives, but let’s not oversell it. There’s still duplication:

  • Target names repeated in both files
  • Two files to maintain instead of one
  • Container orchestration logic repeated per-target (though DRY-able with variables)

It’s not perfect. It’s just… less bad. The duplication is mechanical rather than logical—you’re not mixing concerns, just listing the same names twice. That’s easier to maintain than conditional spaghetti, but it’s still more than ideal.

That said, mechanical duplication is exactly the kind of work AI assistants handle well. “Add a lint target that runs cargo clippy” is a constrained, rule-following task: add it to the container file with the actual command, add a delegation stub to the host file. No judgment calls, no architectural decisions—just pattern application. If you’re already using AI-assisted development, this maintenance overhead largely disappears.

If someone invents a cleaner approach, we’re all ears.

  • Simple projects: If you don’t need containerized builds, don’t add complexity.
  • Uniform environments: If all developers run the same OS with the same toolchain, containers may be overkill.
  • Single-target deployments: If you only deploy to one platform, you might not need the isolation.

The pattern shines for polyglot projects, mixed dev teams (Linux/macOS/WSL), and CI/CD pipelines where consistency matters.

  1. Create Makefile with container delegation
  2. Create Makefile.container with direct commands
  3. Add the mount: -v ./Makefile.container:/workspace/Makefile:ro
  4. Run make build on host and in container—same command, appropriate behavior

For full implementation details, see the angzarr repository for working examples of this pattern.