Skip to main content

Observability and Recovery

Every event carries a correlation ID. Every failure lands in a queue. Nothing silently disappears.


The Philosophy

Event-sourced systems are inherently observable—every state change is recorded. Angzarr extends this with:

  • Correlation IDs linking events across domains
  • Dead letter queues capturing failures
  • Structured tracing through the event flow

When something goes wrong in a poker hand, you can trace the entire flow: from player action to saga to table update to final outcome.


Correlation IDs

Every workflow has a correlation ID that threads through all related events:

illustrative - correlation ID flow
Session starts: correlation_id = "session-abc-123"

PlayerSeated (player domain) → correlation_id
TableJoined (table domain) → correlation_id
HandStarted (hand domain) → correlation_id
PlayerActed (hand domain) → correlation_id
ChipsTransferred (player domain) → correlation_id

Query by correlation ID to see the complete story:

illustrative - correlation query
SELECT * FROM events
WHERE correlation_id = 'session-abc-123'
ORDER BY timestamp;

Propagation

The framework propagates correlation IDs automatically:

illustrative - correlation propagation
# Initial command sets the correlation
send_command(
JoinTable(table_id=table_id),
correlation_id="session-abc-123",
)

# Subsequent sagas inherit it
# TableJoined → saga → PlayerSeated
# All carry the same correlation_id

Dead Letter Queues

Failed events don't disappear. They land in a dead letter queue, tagged with error details:

illustrative - dead letter entry
┌─────────────────────────────────────────────────┐
│ Dead Letter Entry │
├─────────────────────────────────────────────────┤
│ domain: "hand" │
│ event_type: "PlayerActed" │
│ correlation_id: "session-abc-123" │
│ error: "Sequence mismatch: expected 5, got 6" │
│ attempts: 3 │
│ first_failed: 2024-01-15T10:30:00Z │
│ last_failed: 2024-01-15T10:30:02Z │
│ payload: <original event> │
└─────────────────────────────────────────────────┘

Per-Domain Queues

Each domain has its own DLQ:

illustrative - DLQ per domain
angzarr.dlq.player
angzarr.dlq.table
angzarr.dlq.hand

Replay from DLQ

Once you've fixed the issue, replay failed events:

illustrative - DLQ replay
# Replay all failed events for a domain
angzarr dlq replay --domain=hand

# Replay a specific correlation
angzarr dlq replay --correlation-id=session-abc-123

# Discard after review
angzarr dlq discard --domain=hand --before=2024-01-15

Saga Rejection Tracking

When a saga command is rejected, the framework records:

illustrative - rejection notification
┌─────────────────────────────────────────────────┐
│ RejectionNotification │
├─────────────────────────────────────────────────┤
│ rejected_command: DeductFromPlayerStack │
│ rejection_reason: "Insufficient balance" │
│ issuer_name: "saga-hand-player" │
│ issuer_type: "saga" │
│ source_aggregate: {domain: "hand", root: ...} │
│ source_event_sequence: 42 │
└─────────────────────────────────────────────────┘

The rejection flows back to the source aggregate for compensation. The audit trail shows:

  1. What was attempted
  2. Why it failed
  3. How it was compensated

Structured Logging

The framework emits structured logs at key points:

illustrative - structured log (info)
{
"level": "info",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_received",
"command_type": "PlayerActed",
"timestamp": "2024-01-15T10:30:00.123Z"
}
illustrative - structured log (warn)
{
"level": "warn",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_rejected",
"command_type": "PlayerActed",
"reason": "Not player's turn",
"timestamp": "2024-01-15T10:30:00.125Z"
}

Log Levels

LevelUse
errorUnrecoverable failures, DLQ entries
warnRejections, retries, compensation
infoCommand/event flow, lifecycle
debugState reconstruction, saga routing

Metrics

The framework exposes metrics for monitoring:

MetricDescription
angzarr_commands_totalCommands received, by domain and type
angzarr_commands_rejected_totalRejected commands, by reason
angzarr_events_totalEvents persisted
angzarr_replay_duration_secondsState reconstruction time
angzarr_dlq_depthDead letter queue depth, by domain
angzarr_saga_latency_secondsSaga processing time

Expose via Prometheus endpoint:

illustrative - metrics configuration
observability:
metrics:
enabled: true
port: 9090
path: /metrics

Tracing Integration

Connect to distributed tracing systems:

illustrative - tracing configuration
observability:
tracing:
enabled: true
exporter: otlp
endpoint: http://jaeger:4317

Each command becomes a span. Child spans track:

  • State reconstruction
  • Business logic execution
  • Event persistence
  • Saga dispatch
  • Projector updates

Correlation IDs link traces across services.


Debugging a Poker Hand

When a player disputes an outcome:

illustrative - debugging commands
# Find all events for this hand
angzarr events --domain=hand --root=hand-xyz-789

# Trace the full session
angzarr events --correlation-id=session-abc-123

# Check for rejections
angzarr events --domain=hand --root=hand-xyz-789 --type=RejectionNotification

# Replay the hand to verify
angzarr replay --domain=hand --root=hand-xyz-789 --output=json

The event history is the audit trail. Disputes resolve with facts, not guesses.


Alerting Patterns

Configure alerts for operational issues:

illustrative - alerting rules
alerts:
- name: dlq_depth_high
condition: angzarr_dlq_depth > 100
severity: warning

- name: rejection_rate_high
condition: rate(angzarr_commands_rejected_total[5m]) > 0.1
severity: warning

- name: replay_slow
condition: angzarr_replay_duration_seconds > 1
severity: info

See Also