Replay Debugging
Bug report: “Player’s bet was accepted but their chips didn’t decrease.” You need to debug it. Traditional approach: set up a dev database, seed it with approximate data, try to recreate the conditions. Hope you got close enough.
Event sourcing approach: fetch the exact events, replay them through your code. Debug the actual failure.
The Problem with Traditional Debugging
Section titled “The Problem with Traditional Debugging”Debugging production issues typically requires:
- Database setup — Provision a dev database, configure connections
- Data seeding — Create test data that approximates production state
- State recreation — Manually construct the scenario that triggered the bug
- Guesswork — Hope your approximation matches what actually happened
You’re debugging a simulation of the problem, not the problem itself.
Event Sourcing Changes Everything
Section titled “Event Sourcing Changes Everything”In event-sourced systems, the event history IS the state. To debug any aggregate:
- Fetch its events from production
- Replay them through your code
- Debug the exact sequence that failed
No database setup. No seed data. No guesswork.
flowchart LR
subgraph Prod["Production"]
ES["EventStore<br/>(read-only fetch,<br/>unchanged)"]
end
subgraph Dev["Local Development"]
IDE[Your IDE]
R[Replay events]
B[Hit breakpoints]
I[Inspect state]
F[Fix bug]
IDE --> R
IDE --> B
IDE --> I
IDE --> F
end
ES -->|fetch events| IDE
Editions + Speculative Execution
Section titled “Editions + Speculative Execution”Two features combine to make this safe and repeatable:
| Feature | Purpose |
|---|---|
| Editions | Isolate debugging sessions from production |
| Speculative Execution | Run commands without persistence |
Together they let you replay production scenarios locally, modify code, and re-run—as many times as needed—without side effects.
The Workflow
Section titled “The Workflow”1. Capture the Failure Context
Section titled “1. Capture the Failure Context”When an error occurs, capture the aggregate root and sequence:
# Error handler captures contexttry: result = client.execute(command)except CommandRejectedError as e: logger.error( "Command failed", aggregate_root=command.cover.root.value, domain=command.cover.domain, sequence=e.sequence, error=str(e), )2. Fetch Production Events
Section titled “2. Fetch Production Events”Pull the event history that led to the failure:
from angzarr_client import QueryClient
# Connect to production (read-only)client = QueryClient.connect(PRODUCTION_ENDPOINT)
# Fetch the exact event historyquery = EventQuery( cover=Cover(domain="player", root=Uuid(value=aggregate_root)),)event_book = client.get_event_book(query)
print(f"Fetched {len(event_book.pages)} events")for page in event_book.pages: print(f" [{page.sequence}] {page.event.type_url}")3. Replay Locally with Speculative Execution
Section titled “3. Replay Locally with Speculative Execution”Run the failing command against the fetched state—locally, with your debugger attached:
from angzarr_client import SpeculativeClientfrom angzarr_client.proto.angzarr import SpeculateAggregateRequest, Edition
# Connect to LOCAL coordinator (your dev environment)client = SpeculativeClient.connect("localhost:1310")
# Create a debug edition (isolates from everything)edition = Edition(name=f"debug-{issue_id}-{datetime.now().isoformat()}")
# Replay the exact production events + failing commandrequest = SpeculateAggregateRequest( command=failing_command, events=event_book.pages, # Production events edition=edition,)
# SET YOUR BREAKPOINTS HERE# Step through the @handles method and any @applies reducersresponse = client.aggregate(request)
# Inspect what happenedif response.error: print(f"Reproduced error: {response.error}")else: print("Command succeeded - bug may be environment-specific") for page in response.events.pages: print(f" Would emit: {page.event.type_url}")4. Iterate Until Fixed
Section titled “4. Iterate Until Fixed”The beauty: nothing persists. You can:
- Set breakpoints in your handler code
- Step through the
@handlesmethod and any@appliesreducers it depends on - Identify the bug
- Modify your code
- Re-run the same replay
- Verify the fix
- Repeat as needed
No database resets. No re-seeding. No cleanup.
Debugging Sagas
Section titled “Debugging Sagas”Same pattern works for saga issues. Fetch events from the source domain, replay through your saga logic:
# Fetch events that triggered the sagasource_events = query_client.get_event_book(source_query)
# Fetch destination state the saga would have seendest_events = query_client.get_event_book(dest_query)
# Replay through sagarequest = SpeculateSagaRequest( events=source_events.pages, destination=dest_events, edition=debug_edition,)
# Debug your saga logicresponse = client.saga(request)
for cmd in response.commands: print(f"Saga would emit: {cmd.cover.domain}/{cmd}")Debugging Process Managers
Section titled “Debugging Process Managers”For stateful process managers, fetch both the PM’s own event stream and the domain events it received:
# Fetch PM's state (its own events keyed by correlation_id)pm_events = query_client.get_event_book( EventQuery(cover=Cover(domain="pm-hand-flow", root=correlation_id)))
# Fetch domain events the PM receiveddomain_events = query_client.get_event_book(domain_query)
# Replay through PMrequest = SpeculatePMRequest( pm_events=pm_events.pages, domain_events=domain_events.pages, edition=debug_edition,)
response = client.process_manager(request)Cross-Domain Debugging
Section titled “Cross-Domain Debugging”For bugs that span multiple domains, replay the entire flow:
# 1. Fetch Player aggregate eventsplayer_events = query_client.get_event_book(player_query)
# 2. Fetch Table aggregate eventstable_events = query_client.get_event_book(table_query)
# 3. Replay Player commandplayer_result = speculative_client.aggregate( SpeculateAggregateRequest( command=reserve_funds_cmd, events=player_events.pages, ))
# 4. Replay saga translationsaga_result = speculative_client.saga( SpeculateSagaRequest( events=player_result.events.pages, # FundsReserved destination=table_events, ))
# 5. Replay Table command (from saga)table_result = speculative_client.aggregate( SpeculateAggregateRequest( command=saga_result.commands[0], events=table_events.pages, ))
# Now you can see exactly where the cross-domain flow brokeWhy This Works
Section titled “Why This Works”Event sourcing properties that enable this:
| Property | Debugging Benefit |
|---|---|
| Immutability | Events never change—same input, same replay |
| Complete history | Every state transition is recorded |
| Determinism | Same events + same command = same result |
| Isolation | Editions and speculation prevent side effects |
Your production event store becomes a library of reproducible test cases.
Best Practices
Section titled “Best Practices”Capture Context on Errors
Section titled “Capture Context on Errors”Log enough to fetch the event history later:
@error_handlerdef on_command_rejected(error, command, context): logger.error( "command_rejected", domain=command.cover.domain, root=command.cover.root.value.hex(), sequence=context.sequence, correlation_id=context.correlation_id, error_code=error.code, error_message=str(error), )Use Unique Edition Names
Section titled “Use Unique Edition Names”Include issue IDs and timestamps to avoid collisions:
edition = Edition(name=f"debug-{jira_ticket}-{developer}-{timestamp}")Clean Up Editions
Section titled “Clean Up Editions”If you persist speculative results during debugging, clean up afterward:
# After debugging sessiondelete_edition_events(domain="player", edition=edition.name)delete_edition_events(domain="table", edition=edition.name)Script Common Debug Scenarios
Section titled “Script Common Debug Scenarios”Create helper scripts for your team:
# scripts/debug_player_command.pydef debug_player_issue(aggregate_root: str, command: CommandBook): """Replay a player aggregate issue locally.""" events = fetch_production_events("player", aggregate_root) result = speculative_replay(command, events) print_debug_report(result)See Also
Section titled “See Also”- Editions — Temporal branching for isolation
- Speculative Execution — Commands without persistence
- Testing Strategy — Unit-testing
@handlesmethods directly - Error Handling — Capturing error context