Observability and Recovery
Every event carries a correlation ID. Every failure lands in a queue. Nothing silently disappears.
The Philosophy
Section titled “The Philosophy”Event-sourced systems are inherently observable—every state change is recorded. Angzarr extends this with:
- Correlation IDs linking events across domains
- Dead letter queues capturing failures
- Structured tracing through the event flow
When something goes wrong in a poker hand, you can trace the entire flow: from player action to saga to table update to final outcome.
Correlation IDs
Section titled “Correlation IDs”Every workflow has a correlation ID that threads through all related events:
Session starts: correlation_id = "session-abc-123"
PlayerSeated (player domain) → correlation_idTableJoined (table domain) → correlation_idHandStarted (hand domain) → correlation_idPlayerActed (hand domain) → correlation_idChipsTransferred (player domain) → correlation_idQuery by correlation ID to see the complete story:
events = event_service.by_correlation("session-abc-123")The EventService projector exposes correlation lookups over gRPC; the same call is available through the REST gateway.
Propagation
Section titled “Propagation”The framework propagates correlation IDs automatically:
# Initial command sets the correlationsend_command( JoinTable(table_id=table_id), correlation_id="session-abc-123",)
# Subsequent sagas inherit it# TableJoined → saga → PlayerSeated# All carry the same correlation_idDead Letter Queues
Section titled “Dead Letter Queues”Failed events don’t disappear. They land in a dead letter queue, tagged with error details:
Illustrative — Dead Letter Entry:
| Field | Example Value |
|---|---|
domain | "hand" |
event_type | "PlayerActed" |
correlation_id | "session-abc-123" |
error | "Sequence mismatch: expected 5, got 6" |
attempts | 3 |
first_failed | 2024-01-15T10:30:00Z |
last_failed | 2024-01-15T10:30:02Z |
payload | <original event> |
Per-Domain Queues
Section titled “Per-Domain Queues”Each domain has its own DLQ:
angzarr.dlq.playerangzarr.dlq.tableangzarr.dlq.handReplay from DLQ
Section titled “Replay from DLQ”Once you’ve fixed the issue, replay failed events:
# Replay all failed events for a domainangzarr dlq replay --domain=hand
# Replay a specific correlationangzarr dlq replay --correlation-id=session-abc-123
# Discard after reviewangzarr dlq discard --domain=hand --before=2024-01-15Saga Rejection Tracking
Section titled “Saga Rejection Tracking”When a saga command is rejected, the framework records:
Illustrative — RejectionNotification:
| Field | Example Value |
|---|---|
rejected_command | DeductFromPlayerStack |
rejection_reason | "Insufficient balance" |
issuer_name | "saga-hand-player" |
issuer_type | "saga" |
source_aggregate | {domain: "hand", root: ...} |
source_event_sequence | 42 |
The rejection flows back to the source aggregate for compensation. The audit trail shows:
- What was attempted
- Why it failed
- How it was compensated
Structured Logging
Section titled “Structured Logging”The framework emits structured logs at key points:
{ "level": "info", "component": "coordinator", "domain": "hand", "aggregate_id": "hand-xyz-789", "correlation_id": "session-abc-123", "event": "command_received", "command_type": "PlayerActed", "timestamp": "2024-01-15T10:30:00.123Z"}{ "level": "warn", "component": "coordinator", "domain": "hand", "aggregate_id": "hand-xyz-789", "correlation_id": "session-abc-123", "event": "command_rejected", "command_type": "PlayerActed", "reason": "Not player's turn", "timestamp": "2024-01-15T10:30:00.125Z"}Log Levels
Section titled “Log Levels”| Level | Use |
|---|---|
error | Unrecoverable failures, DLQ entries |
warn | Rejections, retries, compensation |
info | Command/event flow, lifecycle |
debug | State reconstruction, saga routing |
Metrics
Section titled “Metrics”The framework exposes metrics for monitoring:
| Metric | Description |
|---|---|
angzarr_commands_total | Commands received, by domain and type |
angzarr_commands_rejected_total | Rejected commands, by reason |
angzarr_events_total | Events persisted |
angzarr_replay_duration_seconds | State reconstruction time |
angzarr_dlq_depth | Dead letter queue depth, by domain |
angzarr_saga_latency_seconds | Saga processing time |
Expose via Prometheus endpoint:
observability: metrics: enabled: true port: 9090 path: /metricsTracing Integration
Section titled “Tracing Integration”Connect to distributed tracing systems:
observability: tracing: enabled: true exporter: otlp endpoint: http://jaeger:4317Each command becomes a span. Child spans track:
- State reconstruction
- Business logic execution
- Event persistence
- Saga dispatch
- Projector updates
Correlation IDs link traces across services.
Debugging a Poker Hand
Section titled “Debugging a Poker Hand”When a player disputes an outcome:
# Find all events for this handangzarr events --domain=hand --root=hand-xyz-789
# Trace the full sessionangzarr events --correlation-id=session-abc-123
# Check for rejectionsangzarr events --domain=hand --root=hand-xyz-789 --type=RejectionNotification
# Replay the hand to verifyangzarr replay --domain=hand --root=hand-xyz-789 --output=jsonThe event history is the audit trail. Disputes resolve with facts, not guesses.
Alerting Patterns
Section titled “Alerting Patterns”Configure alerts for operational issues:
alerts: - name: dlq_depth_high condition: angzarr_dlq_depth > 100 severity: warning
- name: rejection_rate_high condition: rate(angzarr_commands_rejected_total[5m]) > 0.1 severity: warning
- name: replay_slow condition: angzarr_replay_duration_seconds > 1 severity: infoSee Also
Section titled “See Also”- Error recovery — DLQ handling and replay
- Compensation — Saga failure handling
- Operations: Observability — Full observability setup