Stop putting secrets in your shell profile. If you work across multiple cloud projects, languages, and infrastructure stacks concurrently, direnv scopes your environment per directory—so the right credentials are active for the right project, and nothing leaks sideways.
This is a short post about a small tool that eliminates an entire class of “wrong account” mistakes.
If you work on one project with one cloud account, environment variables in .zshrc are fine. Export your tokens, source your profile, move on.
That stops working when you’re a polyglot developer juggling multiple projects with different cloud providers, different accounts, different API keys, and different infrastructure stacks. The .zshrc approach gives you a flat namespace:
Terminal window
# .zshrc — which project are these for?
exportGITHUB_TOKEN="ghp_..."
exportDATABASE_URL="postgres://..."
exportSUPABASE_ANON_KEY="eyJ..."
exportTF_VAR_project="my-gcp-project"
exportNEON_API_KEY="..."
exportPORKBUN_API_KEY="..."
Every shell session sees every secret. Switch from Project A to Project B? Same DATABASE_URL. Same TF_VAR_project. Same everything. You’re one terraform apply away from modifying the wrong infrastructure because you forgot which project’s credentials are loaded.
This isn’t hypothetical. If you’ve ever run a command against the wrong cloud project, you know the feeling.
direnv loads and unloads environment variables based on your current directory. You put an .envrc file in a project directory, and when you cd into it, those variables are set. When you cd out, they’re unset.
cd ~/workspace/travel/infra — travel credentials loaded. cd ~/workspace/angzarr/core — travel credentials gone, Angzarr credentials loaded. No manual sourcing, no remembering which project you’re in, no stale variables from the last directory.
The value scales with the number of projects you maintain. With two projects, you might remember which credentials are active. With five—each with its own cloud provider, database, API keys, and Terraform state—you won’t.
direnv gives you:
Isolation by default. Project A’s credentials don’t exist in Project B’s shell. You can’t accidentally terraform apply against the wrong account because the wrong account’s variables aren’t set.
Secret scoping. API keys, database URLs, and cloud credentials live next to the project that uses them, not in a global profile that every project inherits. Add .envrc to .gitignore and secrets stay local.
Composability..envrc files can source other files, call CLI tools (like bw for Bitwarden or gcloud for GCP tokens), and inherit from parent directories. A team-level .envrc can set shared defaults while a project-level one overrides specifics.
Onboarding simplification. “Clone the repo, create an .envrc with these variables, run direnv allow” is easier to communicate and harder to get wrong than “add these twelve exports to your shell profile and make sure they don’t conflict with your other projects.”
Audit your shell profile. Anything project-specific should move to an .envrc:
Keep in .zshrc
Move to .envrc
PATH modifications
DATABASE_URL
Shell aliases
TF_VAR_*
Tool initialization (nvm, pyenv)
Cloud API keys
Editor config
Project-specific tokens
General preferences
SUPABASE_*, NEON_*, etc.
The rule: if it’s specific to a project or cloud account, it belongs in that project’s .envrc. If it’s about your development environment in general, it stays in .zshrc.
direnv is a small tool. It does one thing. But for developers working across multiple cloud projects, languages, and infrastructure stacks, that one thing eliminates an entire category of mistakes that range from embarrassing to catastrophic.
The cost of adopting it is one line in your .zshrc and a few minutes moving exports into .envrc files. The cost of not adopting it is eventually running terraform destroy against production because you forgot you were still pointing at the wrong project.
This is one of those tools where the setup time is measured in minutes and the first “oh, that would have been bad” moment comes within the week.
Deployment boundaries enforce architectural boundaries—especially when your coworker is an AI that will take any shortcut it can see.
I caught Claude reading directly from another aggregate’s database because the file was visible in the project. Deploying to K8s (via Kind) broke that access and forced it to implement projector services properly. The infrastructure did what code review should have caught.
I started building a board game in angzarr-standalone, the now-deprecated in-process variant of the Angzarr system. The backing stores are SQLite—perfectly fine for prototyping a physical board game on a CQRS framework, though not suitable for real production use of Angzarr. Everything runs in one process, which means every aggregate’s storage is visible to every other component.
Claude, tasked with getting a feature working, found the shortest path: query another aggregate’s database directly. No projector, no gRPC service, no event subscription. Just reach across the aisle and read the data. It works. It’s fast. It violates every principle of aggregate isolation that event sourcing depends on.
I didn’t catch it immediately. We’re in prototype mode—moving fast, revisions planned for later. But the shortcut was building in coupling that would be painful to unwind.
Moving to Kind—a local Kubernetes cluster—broke the access by default. Each aggregate runs in its own pod. There is no shared filesystem. There is no “just read the other database.” If you need data from another aggregate, you go through a projector service.
Claude had no choice but to implement the projector. The architecture became self-enforcing.
This wasn’t the reason I moved to Kind. But it was an immediate, tangible benefit. The deployment topology eliminated an entire class of architectural violations—not through code review, not through linting, not through discipline, but through access control.
LLMs optimize for task completion. Given a goal and visible resources, they will use whatever path gets them there. This is useful—it’s why they’re productive. But it means they will cheerfully violate architectural boundaries that exist only as conventions.
In-process deployment makes everything a convention. Aggregate isolation? Convention. Service boundaries? Convention. Data ownership? Convention. An AI (or a junior developer, or a senior developer under deadline pressure) can bypass any of them because the runtime doesn’t enforce them.
Pod-level isolation converts conventions into constraints:
Boundary
In-Process
K8s Pod
Aggregate data access
Convention (filesystem visible)
Enforced (separate storage)
Service interfaces
Convention (can call anything)
Enforced (network only)
Domain isolation
Convention (shared memory)
Enforced (process boundary)
This doesn’t mean every development environment needs Kubernetes. But it does mean that the gap between your development topology and your production topology is a source of architectural drift—and AI-assisted development widens that gap faster than human-only development does.
Watch the AIs closer than you think you need to. We’re in prototype and proof-of-concept mode—I’m the only user, the risk is low, and revisions and cleanup will come later. That’s fine. But the shortcut Claude took was building in coupling that would make those revisions harder. Prototype mindset means accepting rough edges in implementation, not in architecture. The architecture is the thing that makes cleanup possible later.
Deploy like production earlier than you think. Kind is cheap. Running a local K8s cluster adds minutes to your feedback loop, not hours. The architectural enforcement you get back is worth it—especially when AI is writing code.
Architecture-as-infrastructure beats architecture-as-policy. You can tell an AI “don’t read from other aggregates’ storage.” You can put it in a CLAUDE.md file. You can review every diff. Or you can make it impossible. One of these scales.
Standalone mode was the right call to deprecate. This experience reinforced why Angzarr dropped angzarr-standalone in 0.3.0. In-process variants are convenient for getting started, but they teach habits—and now train AIs—that break in distributed deployment. Better to start with the real topology.
This is the broader lesson, and it extends well beyond this one incident: the key to making AI developers productive is constraining them.
Without constraints, an AI coding assistant is a useful idiot. It can output code that vaguely satisfies requirements. It has no concept of future maintainability. It has no lessons learned beyond what it copies and mimics from its training corpus. It will cheerfully build a system that works today and is unmaintainable tomorrow, because “tomorrow” isn’t in the prompt.
Constraints change the equation. Container isolation, defined interfaces, enforced service boundaries, typed contracts—these don’t just prevent bad architecture. They channel the AI’s output into shapes that are maintainable by default. The AI doesn’t need to understand why aggregate isolation matters. It just needs to be unable to violate it.
This is the same principle that makes strongly-typed languages productive: the compiler rejects invalid programs before they run. K8s pod boundaries, gRPC interface definitions, and aggregate isolation do the same thing at the architecture level. They reject invalid architectures before they deploy.
The difference between an AI that produces throwaway prototypes and an AI that produces maintainable systems isn’t the AI—it’s the constraints it operates within:
Without Constraints
With Constraints
Reads any data it can see
Must use defined query interfaces
Couples components by convenience
Couples components by contract
Satisfies the immediate requirement
Satisfies the requirement within architectural boundaries
Produces code that works
Produces code that works and can be changed later
None of this requires the AI to be smarter. It requires the environment to be constrained—not just opinionated, but enforced. We must make the right choice the easy choice—this works for human developers too, but it’s non-negotiable for AI ones. The AI fills whatever shape you give it. Give it a flat, open codebase and it will sprawl. Give it containers, interfaces, and enforced boundaries and it will build systems that happen to be well-structured—not because it intended to, but because it had no other option.
This has implications for framework design. Angzarr’s pod-per-aggregate model, its gRPC service interfaces, its saga protocols—these were designed for distributed systems correctness. It turns out they’re also exactly what you want when AI is writing the implementation. The framework’s opinions become the AI’s guardrails.
The irony: the AI that took a shortcut around the architecture is the same one helping me write about why shortcuts around the architecture are dangerous. Supervision remains non-optional.
Yes. For most systems, projectors should serve their own data. Run the projector and its gRPC query endpoint in the same pod. Split them when scaling demands it—not before.
This applies more broadly: collocate components that don’t yet need separation, define interfaces as if they were separate, and split when reality demands it. Angzarr will soon apply this same principle to sagas, allowing them to run directly inside aggregate command handlers—with a clean extraction path when they outgrow it.
Put the projector, read store, and gRPC query service in one pod:
graph TD
subgraph Pod
P[Projector<br/>event consumer] -->|writes| RS[(Read Store)]
RS -->|reads| G[gRPC Service<br/>query endpoint]
end
When query load overwhelms the pod, or projection lag degrades query latency, or you need to scale reads independently—pull the gRPC service into its own pod:
graph TD
subgraph Projector Pod
P[Projector<br/>event consumer] -->|writes| RS[(Read Store)]
end
subgraph Query Service Pod
G[gRPC Service<br/>query endpoint] -->|reads| RS
end
The interface doesn’t change. Clients don’t know the difference. You’ve scaled without redesigning.
The gRPC service has its own interface definition from day one. The read store already sits between the projector and the query logic. Splitting them apart is a deployment change, not an architecture change—you’re moving a process boundary, not redesigning a system.
Separating the projector from the query service in a low-traffic system buys you an extra pod to deploy, monitor, and debug; network hops you didn’t need; a coordination problem when the read schema changes; and complexity that exists to solve a scaling problem you don’t have.
This is a tradeoff of correctness versus complexity. Complexity reduction should generally win, as long as it can be corrected when it becomes important. The “correct” architecture solves real problems—independent scaling, isolation of projection lag, read model rebuilds without serving impact—but most systems don’t have those problems yet.
Angzarr will likely soon support incorporating sagas directly into aggregate roots and command handlers. Same motivation: for simple sagas tightly coupled to aggregate logic and not under independent load pressure, a separate saga pod is overhead without benefit. The aggregate handles the command, emits events, and performs the coordination—all in one place.
The constraint is identical: it must be easy to peel back out. When the saga becomes complex, when its scaling needs diverge, when a different team needs to own it—extraction should be straightforward. The saga’s interface is already defined. Its coordination logic is already encapsulated. Moving it to its own process is a deployment decision, not a rewrite.
Not everything should start collocated. Split immediately when:
Load profiles are already divergent. Hundreds of events per second into the projector, millions of queries per second out—these need independent scaling from the start.
Different teams own the read and write paths. Conway’s Law applies. Shared pods across team boundaries create deployment coupling.
The read model serves latency-critical paths. If projection rebuilds can’t impact query latency, process isolation is a correctness requirement, not an optimization.
Compliance or security boundaries require it. Some read models serve sensitive data through restricted endpoints where process isolation is policy.
These are conditions you can evaluate at design time. If none apply, start simple.
Start simple. Collocate components that don’t yet need separation.
Define interfaces as if they were separate. gRPC services, saga protocols, clear boundaries in code.
Split when the pressure appears. Scaling bottlenecks, team ownership changes, reliability requirements.
The split is mechanical, not architectural. Because the interfaces already exist.
Build for the system you have. Design interfaces for the system you might need. Deploy the simplest topology that works.
This post is part of an ongoing series on pragmatic architecture decisions in event-sourced systems. The opinions are informed by building Angzarr and deploying it in production—where elegance matters less than operability.
Angzarr Core 0.3.0 ships today with a fundamental shift in how sagas and process managers handle cross-aggregate coordination: they now receive sequence numbers instead of full event books. This “facts over state rebuilding” change aligns coordinators with the framework’s core philosophy. The release also adds explicit divergence support for edition branching, enabling counterfactual “what-if” scenarios at any point in an aggregate’s timeline.
Saga and Process Manager Protocol Update. Handlers now receive destination_sequences—a map of domain to next sequence number—instead of full EventBook state. This is a breaking proto change. Coordinators stamp commands with sequences and let aggregates decide; they no longer rebuild destination state themselves.
Edition Branching with Explicit Divergence. New branches can now specify an exact divergence point from the main timeline. The storage layer reads events from main up to sequence N, returning them as base state for the new branch. Use case: “What if I had folded at sequence 3 instead of calling?”
Cascade Two-Phase Commit. Merged from the core-cascade-improvements branch, adding cascade_id and committed fields to EventPage, stale cascade cleanup via CascadeReaper, and conflict detection for distributed transactions.
OpenTelemetry 0.31 and Tonic 0.14. Updated to latest observability and gRPC stacks with breaking API changes handled throughout.
Security Fixes. Critical gRPC-Go vulnerabilities patched in the gateway (v1.70.0 → v1.79.3), plus high-severity fixes in AWS-LC and quinn-proto.
18,000 Lines Removed. Standalone mode deleted entirely. The framework now exclusively uses the distributed coordinator architecture.
The proto changes require regenerating client code:
// Before (0.2.x)
messageSagaHandleRequest {
repeatedEventBookdestinations=4;
}
// After (0.3.0)
messageSagaHandleRequest {
map<string, uint64> destination_sequences=4;
}
Same pattern for ProcessManagerHandleRequest.destination_sequences.
Handlers that previously iterated over destination event books to determine state must now use sequences directly. The philosophy: coordinators deal in facts (sequences), not state reconstruction.
The previous design had sagas receiving full event books for destination aggregates. This created several problems:
Unnecessary coupling. Sagas knew how to interpret destination domain events.
Performance overhead. Loading event history for every coordination step.
Philosophy violation. Sagas are coordinators, not domain experts.
The new design treats sequences as facts. A saga knows “Player aggregate is at sequence 7” without knowing what happened in sequences 1-6. It stamps the outbound command with sequence 7, and the aggregate validates whether that sequence is still current.
If the sequence has advanced (concurrent modification), the aggregate rejects the command. The saga retries with fresh sequences. No domain logic leaked into the coordinator.
Last week I argued that you should build deterministic systems with non-deterministic tools—demand TDD from your LLM, get tests first, then implementation. But there’s a problem with that workflow: passing tests aren’t proof that tests are good.
Enter mutation testing: a deterministic tool that validates whether your tests actually test anything.
The previous post established a workflow: LLM writes tests, you review them, LLM implements, tests pass. But consider this test:
#[test]
fntest_reservation_prevents_double_booking() {
letmutplayer= Player::new(500);
player.reserve(300).unwrap();
// This test "passes" but proves nothing
assert!(true);
}
The test runs. It passes. Code coverage tools say the reserve function was called. Everything looks green. But the test validates nothing—you could delete the entire reserve implementation and this test would still pass.
An LLM optimizing for “make the tests pass” might produce exactly this kind of hollow test. Not maliciously—the test looks reasonable at a glance. It calls the right functions. It has assertions. But the assertions don’t constrain the behavior.
LLMs are pattern matchers. They’ve seen thousands of test files and can generate plausible-looking tests at scale. But “plausible-looking” isn’t “meaningful.”
Consider the failure modes:
Tautological assertions. The LLM generates assertions that restate the setup rather than verify behavior:
Missing edge cases. The LLM tests the happy path but misses boundaries:
#[test]
fntest_withdraw() {
letmutaccount= Account::new(100);
account.withdraw(50).unwrap();
assert_eq!(account.balance(), 50);
}
// Never tests: withdraw(100), withdraw(101), withdraw(0)
Implementation-coupled tests. The LLM tests that code does what it does, not what it should do:
#[test]
fntest_hash() {
// Tests current behavior, not correct behavior
assert_eq!(hash("input"), 0x7a3f2b1c);
}
Mutation testing catches all of these. Tautological assertions don’t kill mutants. Missing edge cases leave mutation gaps. Implementation-coupled tests kill mutants but the wrong ones—they’re brittle to refactoring while missing actual bugs.
You review: “Do these tests capture my requirements?”
LLM implements to make tests pass
Run mutation testing on the implementation
Analyze survivors: Which behaviors aren’t actually tested?
Iterate: Add tests that kill survivors, or accept the gap
Step 6 is the key addition. Mutation testing provides objective feedback: “Your tests claim to verify X, but they’d accept this broken version of X.”
This is a deterministic checkpoint in a non-deterministic workflow. The LLM might generate hollow tests. You might miss them in review. But the mutants don’t lie.
Mutation testing works best when business logic is isolated from infrastructure. That’s a core design principle of Angzarr: aggregates, sagas, and projectors contain pure business logic with no database calls, no network I/O, no framework dependencies. The coordinator handles infrastructure; your code handles decisions.
This isolation makes code easier to test—and easier to test meaningfully. When a function takes state and returns events, every branch is reachable without mocking. Mutation testing thrives in this environment.
Industry research provides concrete benchmarks for mutation kill rates:
A common shock: teams with 80-90% code coverage often discover mutation scores of only 30% when first adopting mutation testing. That’s the gap between “tests executed this code” and “tests verified this code.”
In this codebase, mutation testing revealed qualitative patterns consistent with the research:
Pure utility functions should target 80-90%+ kill rates. Functions that transform data without side effects are fully testable. If mutants survive, the tests are incomplete.
// Every branch here should have a mutant-killing test
}
Framework glue tolerates lower rates. gRPC handlers that delegate to tested core logic don’t need exhaustive mutation coverage. Integration tests cover the composition. Mutation testing primarily targets unit tests; integration-heavy code may have hard-to-detect mutants.
Surviving mutants in logging are acceptable. If removing a debug!() call doesn’t break tests, that’s expected—logging is a side effect that doesn’t affect correctness.
The original post argued: use non-deterministic tools to build deterministic artifacts. Tests are deterministic—they pass or fail reproducibly.
But tests can be deterministic while being worthless. A test that always passes is perfectly reproducible. It just doesn’t prove anything.
Mutation testing adds a second layer of deterministic verification: not just “do tests pass?” but “do tests actually constrain behavior?” The mutation tool doesn’t guess. It systematically breaks things and observes results. Either the tests catch the breakage or they don’t.
This is the deterministic arbiter you need when working with LLM-generated tests. The LLM can generate plausible tests at scale. Mutation testing determines whether those tests mean anything.
The previous post established: LLMs draft, humans verify through tests.
This post adds: tests themselves need verification. Mutation testing provides that verification deterministically.
The workflow becomes:
LLM generates tests (constrain before implementing)
You review tests (verify they capture intent)
LLM implements (make tests pass)
Mutation testing validates tests (prove they constrain behavior)
Iterate until mutants are killed
The LLM accelerates drafting. Tests verify the draft. Mutation testing verifies the tests. Each layer is more deterministic than the last, building reliable systems from unreliable components.
Yes, the tests for this post were also mutation-tested. The surviving mutants were in the prose.
The uncomfortable truth: most DDD teams draw their bounded contexts too small.
Not too large—too small. They slice by CRUD entity, by database table, by team org chart. The result? Contexts that cannot make decisions autonomously. Every meaningful operation requires cross-context coordination. The architecture devolves into a distributed monolith with extra network hops.
This post argues for a different principle: a bounded context is correctly sized when every decision that changes its invariants can be made entirely within it, without synchronous runtime dependency on another context.
Eric Evans defined a bounded context as having “a unified model—that is, internally consistent with no contradictions.” He specified that teams should “explicitly define the context within which a model applies… keep the model strictly consistent within these bounds” 1.
But what does “unified” and “consistent” mean in practice?
Here’s the test: Can this context enforce its own business rules without calling out?
If the answer is “we need to ask the Orders context before we can validate a Payment,” then either:
The Payment context is undersized, or
The concepts are in the wrong context entirely
This maps to Evans’ idea that the model is the decision-making unit, not the data-holding unit. A context doesn’t exist to hold data—it exists to make decisions about that data.
Vaughn Vernon defines an invariant as “a business rule that must always be consistent, specifically referring to transactional consistency.” He states: “A properly designed Aggregate is one that can be modified in any way required by the business with its invariants completely consistent within a single transaction” 3.
The implication: every business invariant must have exactly one context that owns and enforces it. If two contexts share enforcement of the same rule, you have hidden coupling—a seam that will cause consistency bugs under load.†
In Angzarr: Each aggregate enforces its invariants in @handles methods on a CommandHandler<State> subclass. The aggregate receives commands, validates against current state, and emits events—all within a single transaction. Cross-aggregate coordination happens asynchronously via sagas, never synchronously within a command handler.
Fowler, interpreting Evans, notes: “Usually the dominant factor drawing boundaries between contexts is human culture—since models act as ubiquitous language, you need a different model when the language changes. Different groups of people will use subtly different vocabularies in different parts of a large organization” 1.
The practical test:
When two teams use the same word to mean different things → context boundary
When one team explains concepts using another team’s vocabulary → wrong context
Evans himself has clarified that “one confusion teams often have is differentiating between bounded contexts and subdomains. In an ideal world they coincide, but in reality they are often misaligned” 5.
Vernon provides the architectural pattern: “There is a practical way to support eventual consistency in a DDD model. An Aggregate command method publishes a Domain Event that is in time delivered to one or more asynchronous subscribers. Each subscriber then retrieves a different yet corresponding Aggregate instance and executes its behavior based on it, each in a separate transaction” 3.
Microsoft’s architecture guidance reinforces this: “When a business process spans multiple aggregates, use domain events rather than a single transaction. Reference other aggregates by identity only—this decoupling maps directly to microservice boundaries” 4.
The principle: prefer eventual consistency across context boundaries over synchronous consistency. If strong consistency is required between two aggregates at runtime, they probably belong in the same context—or your transaction boundary is wrong.
In Angzarr: Aggregates modify only themselves per transaction. Cross-domain communication flows through sagas (stateless translation) or process managers (stateful coordination). Both typically operate asynchronously on committed events.
Angzarr does support synchronous modes for cross-domain calls, but discourages their use—they reintroduce the coupling and availability problems eventual consistency solves. Use sync modes only when business requirements genuinely demand it and you’ve accepted the tradeoffs.
Vernon is explicit: “The consistency boundary logically asserts that everything inside adheres to a specific set of business invariant rules no matter what operations are performed. The consistency of everything outside this boundary is irrelevant to the Aggregate. Aggregates are chiefly about consistency boundaries and not driven by a desire to design object graphs” 3.
ArchiLab reinforces: “A properly designed Aggregate is one that can be modified in any way required by the business with its invariants completely consistent within a single transaction. The consequence of this is that in one transaction, you can only modify one aggregate and never more than one aggregate” 11.
The aggregate boundary is not the context boundary—but it’s a lower bound. A context should contain all aggregates whose invariants reference each other.
In Angzarr: Each domain maps to exactly one aggregate type. Each aggregate instance is identified by {domain}:{root_id}. Multiple Angzarr domains may belong to the same DDD bounded context—they share ubiquitous language and team ownership, but are separate deployment units connected by sagas.
If aggregates share invariants, they either belong in the same aggregate (larger boundary) or require explicit coordination via sagas. Angzarr makes this choice visible in infrastructure rather than hiding it in code organization.
The Anti-Corruption Layer is the integration pattern where a downstream bounded context translates concepts from an upstream context, protecting its own model from the upstream’s influence.
ACLs are correct and necessary at integration points. But if a context needs a thick ACL—translating many concepts—the boundary may warrant re-examination. Sometimes the downstream context is missing concepts it should own; sometimes the upstream context is leaking internal details; sometimes it’s unavoidable legacy integration.†
In Angzarr: Sagas connect Angzarr domains, but not all sagas are ACLs. The distinction:
Internal coordination sagas: Connect domains within the same bounded context. Shared ubiquitous language means minimal translation—mostly routing.
ACL sagas: Cross bounded context boundaries. Different teams, different language. Some translation expected.
The thickness of translation is the signal:
Thin ACL (mapping a few concepts): Normal and expected when crossing BC boundaries
Thick ACL (translating many concepts, complex mappings): Smell—suggests the boundary is in the wrong place or concepts are in the wrong context
The pattern is clear in the literature: domain events stay within the bounded context; integration events are the public contracts for cross-context communication. Commands express intent, and aggregates enforce rules.
A well-sized context accepts commands and enforces rules locally. It publishes domain events for others to react to. If a context issues commands into another context to complete its own operation, the command’s logic belongs in the first context.
In Angzarr: Aggregates accept commands and emit events. Sagas translate events from one domain into facts (or commands) for another. The default saga output is facts—events the receiving domain must accept.
Whether a saga is “translation” (ACL) or “routing” (internal coordination) depends on whether the domains share a bounded context. Angzarr doesn’t enforce this—it’s an organizational decision tracked via K8s labels:
labels:
angzarr.io/bounded-context: "game-ops"
angzarr.io/saga-type: "acl"# or "internal"
This makes the distinction queryable and enforceable via policy. ACLs crossing context boundaries justify heavy translation logic; internal sagas should be thin.
Fowler states: “Domain-Driven Design plays a role with Conway’s Law in helping define organization structures, since a key part of DDD is to identify Bounded Contexts. A key characteristic of a Bounded Context is that it has its own Ubiquitous Language, defined and understood by the group of people working in that context. The key thing to remember about Conway’s Law is that the modular decomposition of a system and the decomposition of the development organization must be done together” 9.
Steve Smith (Ardalis) reinforces this: teams and bounded contexts should correlate, since cross-team ownership of a context risks applying the wrong assumptions or model 10.
Microsoft provides the operational guidance: “If a single team must own multiple unrelated bounded contexts, or a single bounded context requires coordination across many teams, revisit either the boundaries or the team structure” 4.
Domain boundaries should align with team ownership boundaries. A context that spans two teams without a clear seam will degrade—the ubiquitous language will fork, and the model will develop inconsistencies that mirror org chart politics.
A context that owns data but no decisions. All business logic lives in an application service that orchestrates across multiple contexts. Looks like a context, acts like a database table.
This is the context-level manifestation of the anemic domain model anti-pattern: domain objects that contain little or no business logic, serving primarily as data structures while business logic lives in separate service layers.†
One context is sized to “fully contain decisions” by absorbing everything. Correct principle, wrong solution.
Evans himself warned that “total unification of the domain model for a large system will not be feasible or cost-effective” 1. The fix is decomposing by subdomain (core, supporting, generic) and finding the natural seams—not abandoning the decision containment principle.†
An aggregate that enforces invariants but references foreign IDs without local projections, so any validation requires an outbound call.
Vernon explicitly warns against this: “Large aggregates are an anti-pattern. A large-cluster Aggregate will never perform or scale well, and is more likely to fail because false invariants and compositional convenience drove the design, to the detriment of transactional success, performance, and scalability” 11.
The aggregate boundary is wrong, not the context boundary.
In Angzarr: Aggregates may query external, non-event systems (third-party APIs, legacy databases) during command handling to gather decision-making information. But they should only read—never write.
The better pattern: external systems holding state relevant to an aggregate should inject that context as facts into the aggregate, rather than the aggregate pulling it. Push beats pull—the aggregate’s state becomes self-contained, and you avoid synchronous dependencies during command handling.
If your aggregate needs data from another Angzarr domain to validate, that’s a smell. Either project that data locally, adjust the aggregate boundary, or reconsider whether the decision belongs in a different aggregate.
A single business capability is split into two contexts before the model is stable—typically because of team structure. The two halves immediately develop tight coupling because the model isn’t ready to be separated.
Evans has warned against “the bandwagon effect of jumping into microservices and bounded context splits.” He notes “a common misconception is that a microservice is a bounded context, which he calls an oversimplification. When subdomains and bounded contexts are misaligned—such as when a business reorganization creates new subdomains that don’t match existing bounded contexts—this often results in two teams having to work in the same context with increasing risk of ending up with a big ball of mud” 5.
Practitioner wisdom: keep the model in one context longer than feels comfortable, until the language stabilizes.† That “longer than comfortable” state? It may be your legacy system—many monoliths are exactly this, never split because the language never stabilized. That’s not always wrong; sometimes the domain genuinely is one context.
Aggregate references to foreign-context IDs without local copy
Rare
Common—suggests incomplete model
ACL translation surface (# of concepts mapped)
Thin
Thick
Number of context owners per business capability
One
Multiple
These are heuristics, not empirically-derived thresholds. “Few” vs “many” depends on your latency budget and availability requirements. The point is directional: more cross-context coupling = more boundary debt.
Bounded context sizing is a Goldilocks problem:
Too small: Contexts can’t make decisions alone, requiring constant cross-context coordination (the main thesis of this post)
Too large: Contexts become unmaintainable, language diverges internally, teams step on each other (the “God Context” failure mode)
Just right: Each context contains the decisions it needs to make, no more†
Microsoft’s guidance: “Design aggregates to be no smaller than what is required to enforce an invariant within a single transaction. Include only the data that must remain consistent within a single transaction. When you combine unrelated aggregates, you force unrelated updates to compete for the same locks” 4.
Start with subdomains, not microservices. There are three types of subdomains: “Core, Supporting, and Generic. The Core subdomain is where the business must put its best efforts and provides competitive advantage. The Supporting subdomain complements the main domain. The Generic subdomain is typically handled by ready-made commercial or open-source software” 14. Subdomain analysis gives you the strategic cuts first. Bounded contexts then follow subdomain contours.
A context should be deployable and operable by one team. Not one person, not five teams. “Architectural and team evolution must go hand-in-hand throughout the life of an enterprise” 9.
The model should fit in one person’s head. If explaining the context’s model requires a two-hour meeting, it’s too large.†
Event volume is not a sizing signal. High event throughput is a scaling concern, not a domain boundary concern.†
The principles above represent mainstream DDD thinking. But having built Angzarr—an event-sourcing framework for distributed systems—I’ve encountered cases where rigid adherence to these rules creates its own problems.
Flexibility Has Consequences
Angzarr aims to be fast, reliable, and flexible. That flexibility permits building terrible systems:
Synchronous cascades across dozens of aggregates—causing performance problems and availability nightmares
Poor aggregate factoring—undersized aggregates that can’t make decisions alone cause explosions in cross-domain messages and degraded performance
Sagas emitting commands—the mechanism for cross-aggregate decisions, but adds compensation complexity; overuse often signals poor aggregate factoring
God process managers—PMs that orchestrate everything become a single point of failure and a coordination bottleneck; decision logic belongs in aggregates, not PMs
Ignoring every principle in this post—Angzarr won’t stop you
The thesis of this post applies to Angzarr itself: aggregates should make decisions with minimum external contact. Violate that, and you’ll pay in latency, throughput, and operational complexity.
Sometimes these anti-patterns are necessary—even the right choice for your constraints. Angzarr supports them for that reason. But it takes no responsibility for the consequences. We warn you in documentation and, often, in code—make sure you’re choosing the tradeoff deliberately, not accidentally.
Here’s the uncomfortable truth the literature rarely addresses: DDD boundaries are architecture, and architecture is expensive to change.
Conway’s Law cuts both ways. Yes, system structure should align with team structure. But once it does, that alignment becomes load-bearing. Refactoring a bounded context boundary likely means some combination of:
Reorganizing teams (politics, HR, reporting structures)
Migrating data between stores (downtime, consistency risks)
Updating monitoring, alerting, and runbooks (operational knowledge)
The literature says “if your context can’t make decisions autonomously, it’s undersized—fix it.” That’s correct in principle. But fixing it may require executive buy-in, a migration project, and months of coordination. Meanwhile, the business needs to ship features.
Angzarr takes a pragmatic stance: support sub-ideal boundaries with tooling when refactoring isn’t feasible.
This isn’t an endorsement of bad architecture. It’s an acknowledgment that production systems exist, Conway’s Law has inertia, and sometimes the operationally necessary choice is to work within existing constraints while planning longer-term improvements.
The principle that commands should stay local while only events cross boundaries is elegant in theory. In practice, it can force awkward aggregate designs.
Consider a saga that translates an OrderCompleted event into fulfillment work. The fulfillment domain needs to create a shipment. Under strict “events only” thinking, the saga should publish an event like FulfillmentRequested, which the fulfillment context reacts to.
But what happens when fulfillment fails? The saga has no mechanism to compensate—it fired an event and walked away. The fulfillment context now owns the problem entirely, even though the business process spans both domains.
Angzarr takes a different approach. Sagas can emit either commands or facts to other aggregates:
Facts (the default for saga output): Events injected without a preceding command. Structurally, facts are just events with two differences: the receiving domain assigns the sequence number, and they retain source traceability metadata. The receiving aggregate cannot reject them; they represent external realities. Example: “the hand says it’s your turn” is a fact the player aggregate must accept.
Commands (when compensation is needed): Requests that the receiving aggregate can reject. Use commands when:
The receiving aggregate should be able to refuse (insufficient inventory, invalid state)
Rejection must trigger compensation in the originating domain
The saga uses destination state to make business decisions
Both patterns require:
The saga uses the destination aggregate’s state to inform decisions
General guidance: prefer facts for saga output unless you need rejection/compensation capability. Facts are simpler—they represent “this happened” rather than “please do this.” Commands add complexity but enable explicit failure handling.
A warning: If you find yourself reaching for saga-emitted commands frequently, pause and ask whether your bounded contexts are correctly sized. A saga that needs to send commands with compensation capability may be a signal that:
The two aggregates belong in the same context (shared invariants requiring coordination)
The decision logic is in the wrong aggregate (should move upstream)
A process manager is more appropriate than a saga (Angzarr’s process managers are stateful, use the correlation ID as their aggregate root, and explicitly coordinate multi-domain workflows)
Angzarr supports the pattern because sometimes it’s genuinely correct. But “supported” doesn’t mean “encouraged.” Treat saga-emitted commands as a code smell worth investigating, even when it’s the right solution.
This nuance isn’t captured by the simple “commands stay local, events cross boundaries” rule. The question isn’t command-vs-event—it’s whether the receiving domain has veto power over the incoming information.
Vernon’s guidance to keep aggregates small—containing only what’s needed for invariant enforcement in a single transaction—is sound. But it can be taken too far.
An aggregate that’s too small becomes a data container that delegates all decisions outward. Every validation requires a saga or process manager to orchestrate across aggregates. You’ve achieved small aggregates at the cost of coherent decision-making.
The opposite risk is real too: an aggregate that absorbs everything becomes a bottleneck. But in my experience, teams more often err toward undersized aggregates than oversized ones—particularly when influenced by microservices culture that conflates “small services” with good architecture.
The test isn’t aggregate size. It’s: can this aggregate make its decisions without runtime dependencies?
Supported for legacy boundaries; refactor when possible
Undersized aggregates requiring coordination
Anti-pattern
Supported with compensation tooling; indicates boundary debt
”Large” aggregates
Anti-pattern
Sometimes correct—if the aggregate owns a cohesive set of decisions
These are escape hatches, not recommended patterns. Each represents technical debt—a workaround for boundaries that should ideally be redrawn. Angzarr provides the tooling because:
Production systems exist. You inherited boundaries drawn by someone else, possibly years ago.
Conway’s Law has inertia. Fixing the architecture may require fixing the org chart first.
Business doesn’t wait. Features ship while migration projects are planned.
The correct response to needing these patterns is:
Use them to unblock the immediate work
Document the boundary debt
Plan the refactoring (even if it’s quarters away)
Don’t let “supported” become “normalized”
The literature provides excellent defaults. When you deviate, know why you’re deviating and have a plan to stop.
The underlying principle remains: size your contexts and aggregates to contain decisions. When you can’t—because the boundaries are already drawn and load-bearing—Angzarr helps you cope. But coping isn’t thriving. Fix the boundaries when you can.
A domain boundary is a decision boundary, not a data boundary or a service boundary. Draw it where decisions are made, not where data lives.
Most teams err by slicing too thin—creating contexts that own data but cannot decide. The result is an architecture where every operation requires coordination, every deployment requires synchronization, and the system exhibits all the costs of distribution with none of the benefits of autonomy.
Size your contexts to contain decisions. If a context cannot enforce its invariants alone, it’s too small.
The following claims in this post represent synthesis from the cited sources and practitioner experience, rather than direct quotations:
[†] The claim that “two contexts enforcing the same invariant produces hidden coupling” is extrapolated from Vernon’s consistency boundary rules.
[†] The interpretation that thick ACLs warrant boundary re-examination is practitioner intuition, not a direct Evans/Vernon citation.
[†] The extension of the anemic domain model anti-pattern to the context level is derived analysis.
[†] “God Context” as a named failure mode is framing for this post, not canonical DDD terminology.
[†] Common practitioner advice without verified primary source.
[†] The structural metrics table is synthesized from the cited sources and practitioner literature, not a canonical table from Evans or Vernon.
[†] The Decision Containment Score is a framework constructed for this post as an operationalization of the decision containment principle. It does not appear in the primary sources.
[†] Common practitioner wisdom without verified primary source.
[†] “Event volume is not a sizing signal” is synthesis for this post without primary source support.
Large Language Models are probabilistic text generators. Their raw outputs cannot be trusted for correctness. So how do you build reliable software with unreliable assistants?
You don’t ask for answers. You ask for tools that produce answers.
Large Language Models (LLMs)—the technology behind ChatGPT, Claude, and similar AI assistants—are probabilistic text generators. They predict the next most likely token based on patterns learned from training data. This makes them remarkably useful for many tasks, but it also means their raw outputs cannot be trusted for correctness.
Ask an LLM to calculate something, and it might be right. Or it might confidently produce nonsense. Ask it again, and you might get a different answer. This non-determinism is a feature of how these systems work, not a bug to be fixed.
So how do you build reliable software with unreliable assistants?
You don’t ask for answers. You ask for tools that produce answers.
This is non-negotiable: require test-driven development from your LLM.
Not “write tests.” Not “include tests.” Write the tests first, get my approval, then implement.
Here’s why this matters for non-deterministic systems:
Tests are a contract. When the LLM writes tests first, it’s forced to articulate what it thinks you want. You review that articulation before any implementation exists. Misunderstandings surface when they’re cheap to fix—before hundreds of lines of code encode the wrong assumptions.
Tests constrain the solution space. An LLM with a blank canvas will produce something. An LLM with failing tests to satisfy has a target. The non-determinism still exists, but it’s bounded by concrete assertions.
Tests are reviewable by humans. Implementation code requires understanding algorithms, data structures, edge cases. Test code requires understanding intent: “when X happens, Y should result.” You can review whether tests capture your requirements without being an expert in the implementation language.
The workflow:
Describe what you want
LLM writes tests (not implementation)
You review: “Do these tests capture my requirements?”
Iterate until tests are correct
LLM implements to make tests pass
You verify tests actually pass
If the LLM writes implementation before tests, reject it. “Stop. Tests first. Show me what you think success looks like before you show me how to achieve it.”
This isn’t pedantry. It’s the difference between reviewing a blueprint and reviewing a finished building. One is cheap to change. The other isn’t.
When demanding TDD, demand that tests document the problem, not just the solution:
deftest_reservation_prevents_double_spending():
"""
Problem: Players could join multiple poker tables with the same bankroll,
creating settlement disputes when they lose at both tables simultaneously.
Solution: Fund reservation locks a portion of the bankroll, making it
unavailable for other reservations until released.
This test verifies that a second reservation fails when insufficient
unreserved funds remain.
"""
player =Player(bankroll=500)
player.reserve(300) # First table
with pytest.raises(InsufficientFunds):
player.reserve(300) # Second table - should fail
assert player.available_balance ==200
The docstring explains:
What problem exists (double-spending across tables)
Why this solution (fund locking)
What this specific test validates (second reservation fails)
The test code shows how the solution works.
This transforms tests from “verification that code works” into “documentation of why code exists.” The test docstring is the right place for explanations—it’s coupled to the behavior it describes and breaks visibly when the behavior changes.
TDD handles code generation. But what about understanding existing code?
The illuminated code walkthrough is a collaborative reading pattern where AI narrates execution flow while you read the code. Like illuminated manuscripts with their explanatory marginalia, the AI provides context and commentary that helps you understand what you’re seeing—without that commentary becoming permanent (and eventually stale) documentation.
Start with flows, not files. The most valuable walkthroughs trace execution paths: “Walk me through what happens when a user places an order” or “Step through the integration test for hand completion.” You follow complete paths from entry point through all possible endings.
The AI narrates: “The OrderCompleted event triggers the fulfillment saga, which emits a CreateShipment command to the fulfillment aggregate, which…” You read each piece of code as it becomes relevant, understanding the full path rather than isolated functions.
One step at a time. The AI presents each function or handler in execution order, not file order. You see the code in the sequence it actually runs.
AI explains as you go. What data flows in? What transforms? What side effects occur? The AI provides narrative while you read, connecting each step to the last.
AI questions unusual patterns. Not just description—interrogation. “This saga assumes the inventory check already passed, but I don’t see where that’s enforced.” The AI acts as a second set of eyes on the flow, not just the code.
You control the pace. The AI asks “Changes, or continue?” Don’t proceed until you understand how this step connects to the whole.
The interaction:
You name the flow: “Walk me through the table-to-hand event flow”
AI presents the entry point with context
AI follows execution to the next handler, explaining the transition
AI flags potential issues in the flow
AI asks: “Changes, or continue?”
Repeat until the flow completes
This works especially well with integration tests. The test defines the scenario; the illuminated walkthrough reveals every step of execution that makes the test pass. You understand not just that it works, but how it works—and whether the “how” matches your mental model.
Crucially, you’re validating the AI’s understanding in real-time. When it misexplains a transition or loses the thread, you catch it immediately. This trains your calibration of when to trust its output and when to dig deeper.
The illumination is ephemeral by design. It helps you understand the code now. Don’t paste it into comments—as the code changes, the explanations become stale lies. The test docstrings are your durable documentation; the illuminated walkthrough is scaffolding you discard when the session ends.
Whether generating new code or reviewing existing code, start with tests. For generation: “Write tests first, then implement.” For review: “Walk me through the tests, then the implementation.”
2. Require problem documentation in tests
Every test function should document the specific problem it validates. This is the right place for durable explanations—coupled to behavior, visible when behavior changes.
3. Let the illumination be ephemeral
AI explanations during illuminated walkthroughs help you understand code in the moment. Don’t preserve them as comments—they’ll rot. Use them, then let them go.
4. Verify incrementally
Don’t let the LLM write 500 lines before you review. Small batches, frequent verification. Errors compound.
5. Run everything
Actually execute the tests. Actually check the output. “It should work” is not the same as “it works.”
The artifacts can be deterministic even when the generation process isn’t. The narrative helps you understand but shouldn’t be preserved—it’s tied to a moment in time, not to the code itself.
Your job is to:
Demand tests first (constrain before implementing)
Review the tests (verify they capture intent)
Verify the artifacts (run, don’t assume)
Use the narrative to understand (then let it go)
Put durable documentation in test docstrings (coupled to behavior)
The LLM accelerates the drafting and illuminates the reading. You ensure the correctness.
This isn’t a limitation to work around. It’s the appropriate division of labor between a probabilistic generator and a human who needs reliable systems.
The irony of this post being written with AI assistance is not lost on me. The difference: I reviewed every claim, verified it matched my experience, and take responsibility for the result. That’s the model.
The most effective LLM workflows share one trait: they force a pause between planning and execution. You wouldn’t let a contractor start demolition before approving blueprints. The same applies to AI assistants.
Before any implementation, require a plan. Not pseudocode, not a summary of what the LLM intends to do. A concrete list of files to touch, functions to modify, and decisions that need your input.
Plans expose assumptions. An LLM might assume you want bcrypt when you’re using Argon2, or assume PostgreSQL when you’re on SQLite. Catching this before code exists saves hours.
More importantly, plans surface questions the LLM should ask but often doesn’t. “Should this be configurable?” and “What happens on failure?” are questions better asked before implementation than discovered during code review.
For existing code, planning becomes reviewing. The illuminated code walkthrough applies the same checkpoint principle: AI narrates execution flow one step at a time while you read along, controlling the pace.
The interaction:
AI presents a function or handler with explanation
AI flags potential issues
AI asks: “Changes, or continue?”
Human responds
Repeat
This works especially well when tracing integration tests or application flows—you follow complete paths from entry point through all possible endings.
LLMs work best with feedback loops, not fire-and-forget prompts. Plan mode creates one checkpoint. Illuminated walkthroughs create many. Both share the same principle: you can’t review what you haven’t seen.
Build the pause into your workflow. The LLM will produce better work, and you’ll catch problems before they become expensive.
We used to draw a hard line between unit tests and integration tests:
Unit tests: Fast, no external dependencies, run anywhere, colocate with code
Integration tests: Slow, need databases/queues/services, run in CI, separate directory
This separation made sense when “integration test” meant “spin up a full environment.” You wouldn’t colocate tests that require PostgreSQL next to your repository implementation; they’d fail on every developer’s machine without the right setup.
This test spins up a real PostgreSQL instance in Docker, runs the test against it, and tears it down. No shared database. No environment configuration. No “works on my machine.” The container is ephemeral, isolated, and automatic.
We call these Behavioral Interface Tests (BITs): tests that verify an implementation correctly fulfills its interface’s behavioral contract. Tests that verify trait implementations (EventStore, SnapshotStore, MessageBus) are BITs—not “integration tests” in the traditional sense.
│ ├── postgres.bit.rs # BITs against real Postgres
│ ├── sqlite.rs
│ └── sqlite.bit.rs
The “real database” aspect doesn’t change where the test belongs. It’s still testing one module’s behavior. It’s still colocated. It just happens to need a container.
(Why “BIT”? It’s a pun. “The BIT caught a regression.” “That edge case BIT me.” Also: Behavioral Interface Test.)
The old unit/integration split was about how tests run. The better distinction is what they test.
Test Type
What It Tests
Where It Lives
Unit
Pure logic, no dependencies
Adjacent .test file
BIT
Single implementation against its interface
Adjacent .test file (with testcontainers)
Integration
Multiple components interacting
tests/ directory
End-to-end
Full system behavior
Separate test project
BITs with testcontainers are closer to unit tests than integration tests. They test one thing. They’re fast enough to run frequently. They should be colocated.
Yes, testcontainer tests are slower than pure unit tests. On my machine, a PostgreSQL container adds ~2 seconds of startup. That’s too slow for “run on every save” but fine for “run before commit.”
fntest_postgres_storage() { /* runs with --features testcontainers */ }
Local development runs the fast tests continuously. Pre-commit hooks (we like Lefthook) and CI run everything. The slower tests are still colocated; they’re just conditionally executed.
This shift changed how I think about mocking. Previously, I’d mock the database to test repository logic. Now I test the repository against a real database (via testcontainers) and reserve mocks for:
External services I don’t control (third-party APIs)
Failure injection (simulate network errors)
If I can test against the real thing cheaply, I should. Testcontainers made “the real thing” cheap.
The unit/integration distinction was always about economics: unit tests were cheap, integration tests were expensive. Testcontainers collapsed that cost difference for many scenarios.
When the economics change, the categories should too. BITs against real infrastructure aren’t integration tests just because they touch a database. They’re colocatable, fast-enough, single-purpose tests that happen to need Docker.
Organize by what you’re testing, not by what tools you need to test it.
Prior art: This concept aligns with what some call “Behavioral Contract Testing” (jdecool.fr) and the Abstract Test pattern (testingpatterns.net). We prefer “BIT” because it’s punchier and avoids confusion with Consumer-Driven Contract testing (Pact, etc.).
Tests should live next to the code they test—same directory, separate file. Not inline. Not in a parallel tree.
src/
├── user_service.rs # Production code only
├── user_service.test.rs # Tests only
└── mod.rs
AI context windows changed my thinking. When an AI reads a 500-line file where 300 lines are tests, it wastes 60% of its context budget on code irrelevant to most tasks. Separate files let AI skip tests; inline tests force everything into context.
Java’s src/main/src/test split goes too far—that was a workaround for the JVM’s inability to exclude code at compile time. Modern languages (Rust, Go) solved this. We get colocation without the baggage.
The principle: Tests belong near code. The implementation: Same directory, separate file, clearly named (.test.rs, _test.go).
I used to prefer Rust’s #[cfg(test)] mod tests pattern: maximum colocation, one scroll shows everything.
Working with AI assistants changed my mind. Every token in an AI context window has a cost. Inline tests create noise: search for business logic, get hits in test assertions, fixtures, helpers. Ask an AI to understand authentication, it loads 47 test cases it doesn’t need.
The problem isn’t that tests exist. It’s that inline tests are in the way.
Separate files preserve colocation (one directory listing shows both) while enabling selective loading. AI tools skip .test files. Humans wanting documentation head for the tests. Choice instead of force.
The JVM’s class loading model forced physical separation:
No conditional compilation. Unlike Rust’s #[cfg], Java can’t say “compile this class but exclude it from the JAR.” Every .class file could end up in production.
Heavy test dependencies. JUnit, Mockito, assertion libraries add megabytes. You don’t want them shipped.
Classpath-based loading. The only way to exclude code was to put it in a different directory and configure the packager to ignore it.
Maven’s Surefire plugin runs tests from target/test-classes. The JAR plugin packages from target/classes. They never overlap because the source directories never overlapped. Physical separation at source level cascades to physical separation everywhere.
my-project/
├── src/
│ ├── main/java/com/example/UserService.java
│ └── test/java/com/example/UserServiceTest.java
└── pom.xml
To find tests for UserService: up from src/main/java/com/example/, over to src/test/java/, back down through com/example/. That’s not “next to the code.” That’s an archaeological expedition.
Rust’s #[cfg(test)] eliminates code at compile time:
pubstruct UserService { /* ... */ }
#[cfg(test)]
mod tests {
usesuper::*;
#[test]
fntest_create_user() { /* ... */ }
}
In release builds, the test module doesn’t exist—not compiled, not linked, not present. Test dependencies ([dev-dependencies]) are only linked when building tests.
No deployment risk. No dependency contamination. No separate directories needed.
We use Rust’s .test.rs pattern with the #[path] attribute:
src/
├── correlation.rs # Production code
├── correlation.test.rs # Tests
└── mod.rs
mod.rs
pubmod correlation;
#[cfg(test)]
#[path ="correlation.test.rs"]
mod correlation_tests;
This gives us:
Tests adjacent to code (same directory)
Production files focused on implementation
Test files skippable when reading for understanding
Conditional compilation via #[cfg(test)]
Clean mutation testing workflow
Mutation testing benefits: Separate files pair well with tools like cargo-mutants. If a mutation survives (accidentally gets committed), it’s in correlation.rs; the test file is untouched. Revert the production file, keep the tests. With inline tests, reverting means losing both mutated code and test improvements.
Test Support Files: When Production Needs Test Logic
Sometimes production code needs to call test-specific logic—mock handlers, test fixtures, specialized parsers for test data. The #[cfg(test)] block inside the production function works, but what if it’s substantial? Inline test code pollutes the production file.
The solution: test support files using the same #[path] pattern.
src/orchestration/aggregate/
├── merge.rs # Production code (clean)
├── merge_test_support.rs # Test helpers (separate file)
You want readers to see business logic, not test fixtures
Visibility note: Use pub(crate) if sibling test modules need access; pub(super) if only the parent module calls the helpers.
This reduced our merge.rs from ~300 lines to ~205 lines—all test code now lives in adjacent files, still colocated but not inline.
Context window impact: The same principle from the intro applies here. When an AI assistant reads merge.rs to understand commutative merge logic, it gets 205 lines of business logic—not 300 lines where a third is test fixture parsing. The _test_support.rs file exists for when context needs test helpers; otherwise it’s skipped. Every line of test code in a production file is a line competing for attention in a context window that could hold actual implementation details.
Every position in this article emerged from tooling constraints of its era. Java’s parallel directories made sense when the JVM couldn’t exclude code. Rust’s inline tests made sense when file size didn’t compete with AI context budgets.
Tomorrow’s tradeoffs will differ. AI context windows will grow. IDE integrations will get smarter. When constraints change, optimal organization changes too.
What won’t change: tests belong near the code they test. The definition of “near” adapts to tooling. The principle doesn’t.
We wanted containerized builds for consistency across developer machines and CI. But every approach we tried had friction:
Dual Makefiles (Makefile and Makefile.docker): Works, but now everyone has to remember which file to use. Documentation says “run make -f Makefile.docker build” and someone inevitably runs make build instead.
Conditional detection: Check for /.dockerenv or an environment variable:
This works but clutters the Makefile. Every target needs the conditional. The file becomes a maze of ifeq/else/endif blocks.
Different commands: make build on host, make container-build for Docker. Now you have parallel target names, duplicate documentation, and cognitive overhead.
We wanted something simpler: same command, different behavior based on context.
Docker bind mounts can replace individual files inside the container. The Docker documentation even mentions this—if you mount over an existing file, the original is “obscured.”
What if we mount a different Makefile over the host’s Makefile inside the container?
With separate Makefile and Makefile.docker, users must know which to invoke. CI scripts use one, developers might use another. Documentation has to explain both.
With the overlay pattern, there’s one command: make build. It works everywhere. The context determines the implementation.
This pattern is better than the alternatives, but let’s not oversell it. There’s still duplication:
Target names repeated in both files
Two files to maintain instead of one
Container orchestration logic repeated per-target (though DRY-able with variables)
It’s not perfect. It’s just… less bad. The duplication is mechanical rather than logical—you’re not mixing concerns, just listing the same names twice. That’s easier to maintain than conditional spaghetti, but it’s still more than ideal.
That said, mechanical duplication is exactly the kind of work AI assistants handle well. “Add a lint target that runs cargo clippy” is a constrained, rule-following task: add it to the container file with the actual command, add a delegation stub to the host file. No judgment calls, no architectural decisions—just pattern application. If you’re already using AI-assisted development, this maintenance overhead largely disappears.
If someone invents a cleaner approach, we’re all ears.