Observability and Recovery
Every event carries a correlation ID. Every failure lands in a queue. Nothing silently disappears.
The Philosophy
Event-sourced systems are inherently observable—every state change is recorded. Angzarr extends this with:
- Correlation IDs linking events across domains
- Dead letter queues capturing failures
- Structured tracing through the event flow
When something goes wrong in a poker hand, you can trace the entire flow: from player action to saga to table update to final outcome.
Correlation IDs
Every workflow has a correlation ID that threads through all related events:
Session starts: correlation_id = "session-abc-123"
PlayerSeated (player domain) → correlation_id
TableJoined (table domain) → correlation_id
HandStarted (hand domain) → correlation_id
PlayerActed (hand domain) → correlation_id
ChipsTransferred (player domain) → correlation_id
Query by correlation ID to see the complete story:
SELECT * FROM events
WHERE correlation_id = 'session-abc-123'
ORDER BY timestamp;
Propagation
The framework propagates correlation IDs automatically:
# Initial command sets the correlation
send_command(
JoinTable(table_id=table_id),
correlation_id="session-abc-123",
)
# Subsequent sagas inherit it
# TableJoined → saga → PlayerSeated
# All carry the same correlation_id
Dead Letter Queues
Failed events don't disappear. They land in a dead letter queue, tagged with error details:
┌─────────────────────────────────────────────────┐
│ Dead Letter Entry │
├─────────────────────────────────────────────────┤
│ domain: "hand" │
│ event_type: "PlayerActed" │
│ correlation_id: "session-abc-123" │
│ error: "Sequence mismatch: expected 5, got 6" │
│ attempts: 3 │
│ first_failed: 2024-01-15T10:30:00Z │
│ last_failed: 2024-01-15T10:30:02Z │
│ payload: <original event> │
└─────────────────────────────────────────────────┘
Per-Domain Queues
Each domain has its own DLQ:
angzarr.dlq.player
angzarr.dlq.table
angzarr.dlq.hand
Replay from DLQ
Once you've fixed the issue, replay failed events:
# Replay all failed events for a domain
angzarr dlq replay --domain=hand
# Replay a specific correlation
angzarr dlq replay --correlation-id=session-abc-123
# Discard after review
angzarr dlq discard --domain=hand --before=2024-01-15
Saga Rejection Tracking
When a saga command is rejected, the framework records:
┌─────────────────────────────────────────────────┐
│ RejectionNotification │
├─────────────────────────────────────────────────┤
│ rejected_command: DeductFromPlayerStack │
│ rejection_reason: "Insufficient balance" │
│ issuer_name: "saga-hand-player" │
│ issuer_type: "saga" │
│ source_aggregate: {domain: "hand", root: ...} │
│ source_event_sequence: 42 │
└─────────────────────────────────────────────────┘
The rejection flows back to the source aggregate for compensation. The audit trail shows:
- What was attempted
- Why it failed
- How it was compensated
Structured Logging
The framework emits structured logs at key points:
{
"level": "info",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_received",
"command_type": "PlayerActed",
"timestamp": "2024-01-15T10:30:00.123Z"
}
{
"level": "warn",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_rejected",
"command_type": "PlayerActed",
"reason": "Not player's turn",
"timestamp": "2024-01-15T10:30:00.125Z"
}
Log Levels
| Level | Use |
|---|---|
error | Unrecoverable failures, DLQ entries |
warn | Rejections, retries, compensation |
info | Command/event flow, lifecycle |
debug | State reconstruction, saga routing |
Metrics
The framework exposes metrics for monitoring:
| Metric | Description |
|---|---|
angzarr_commands_total | Commands received, by domain and type |
angzarr_commands_rejected_total | Rejected commands, by reason |
angzarr_events_total | Events persisted |
angzarr_replay_duration_seconds | State reconstruction time |
angzarr_dlq_depth | Dead letter queue depth, by domain |
angzarr_saga_latency_seconds | Saga processing time |
Expose via Prometheus endpoint:
observability:
metrics:
enabled: true
port: 9090
path: /metrics
Tracing Integration
Connect to distributed tracing systems:
observability:
tracing:
enabled: true
exporter: otlp
endpoint: http://jaeger:4317
Each command becomes a span. Child spans track:
- State reconstruction
- Business logic execution
- Event persistence
- Saga dispatch
- Projector updates
Correlation IDs link traces across services.
Debugging a Poker Hand
When a player disputes an outcome:
# Find all events for this hand
angzarr events --domain=hand --root=hand-xyz-789
# Trace the full session
angzarr events --correlation-id=session-abc-123
# Check for rejections
angzarr events --domain=hand --root=hand-xyz-789 --type=RejectionNotification
# Replay the hand to verify
angzarr replay --domain=hand --root=hand-xyz-789 --output=json
The event history is the audit trail. Disputes resolve with facts, not guesses.
Alerting Patterns
Configure alerts for operational issues:
alerts:
- name: dlq_depth_high
condition: angzarr_dlq_depth > 100
severity: warning
- name: rejection_rate_high
condition: rate(angzarr_commands_rejected_total[5m]) > 0.1
severity: warning
- name: replay_slow
condition: angzarr_replay_duration_seconds > 1
severity: info
See Also
- Error recovery — DLQ handling and replay
- Compensation — Saga failure handling
- Operations: Observability — Full observability setup