Skip to content

Observability and Recovery

Every event carries a correlation ID. Every failure lands in a queue. Nothing silently disappears.


Event-sourced systems are inherently observable—every state change is recorded. Angzarr extends this with:

  • Correlation IDs linking events across domains
  • Dead letter queues capturing failures
  • Structured tracing through the event flow

When something goes wrong in a poker hand, you can trace the entire flow: from player action to saga to table update to final outcome.


Every workflow has a correlation ID that threads through all related events:

illustrative - correlation ID flow
Session starts: correlation_id = "session-abc-123"
PlayerSeated (player domain) → correlation_id
TableJoined (table domain) → correlation_id
HandStarted (hand domain) → correlation_id
PlayerActed (hand domain) → correlation_id
ChipsTransferred (player domain) → correlation_id

Query by correlation ID to see the complete story:

illustrative - correlation query
events = event_service.by_correlation("session-abc-123")

The EventService projector exposes correlation lookups over gRPC; the same call is available through the REST gateway.

The framework propagates correlation IDs automatically:

illustrative - correlation propagation
# Initial command sets the correlation
send_command(
JoinTable(table_id=table_id),
correlation_id="session-abc-123",
)
# Subsequent sagas inherit it
# TableJoined → saga → PlayerSeated
# All carry the same correlation_id

Failed events don’t disappear. They land in a dead letter queue, tagged with error details:

Illustrative — Dead Letter Entry:

FieldExample Value
domain"hand"
event_type"PlayerActed"
correlation_id"session-abc-123"
error"Sequence mismatch: expected 5, got 6"
attempts3
first_failed2024-01-15T10:30:00Z
last_failed2024-01-15T10:30:02Z
payload<original event>

Each domain has its own DLQ:

illustrative - DLQ per domain
angzarr.dlq.player
angzarr.dlq.table
angzarr.dlq.hand

Once you’ve fixed the issue, replay failed events:

illustrative - DLQ replay
# Replay all failed events for a domain
angzarr dlq replay --domain=hand
# Replay a specific correlation
angzarr dlq replay --correlation-id=session-abc-123
# Discard after review
angzarr dlq discard --domain=hand --before=2024-01-15

When a saga command is rejected, the framework records:

Illustrative — RejectionNotification:

FieldExample Value
rejected_commandDeductFromPlayerStack
rejection_reason"Insufficient balance"
issuer_name"saga-hand-player"
issuer_type"saga"
source_aggregate{domain: "hand", root: ...}
source_event_sequence42

The rejection flows back to the source aggregate for compensation. The audit trail shows:

  1. What was attempted
  2. Why it failed
  3. How it was compensated

The framework emits structured logs at key points:

illustrative - structured log (info)
{
"level": "info",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_received",
"command_type": "PlayerActed",
"timestamp": "2024-01-15T10:30:00.123Z"
}
illustrative - structured log (warn)
{
"level": "warn",
"component": "coordinator",
"domain": "hand",
"aggregate_id": "hand-xyz-789",
"correlation_id": "session-abc-123",
"event": "command_rejected",
"command_type": "PlayerActed",
"reason": "Not player's turn",
"timestamp": "2024-01-15T10:30:00.125Z"
}
LevelUse
errorUnrecoverable failures, DLQ entries
warnRejections, retries, compensation
infoCommand/event flow, lifecycle
debugState reconstruction, saga routing

The framework exposes metrics for monitoring:

MetricDescription
angzarr_commands_totalCommands received, by domain and type
angzarr_commands_rejected_totalRejected commands, by reason
angzarr_events_totalEvents persisted
angzarr_replay_duration_secondsState reconstruction time
angzarr_dlq_depthDead letter queue depth, by domain
angzarr_saga_latency_secondsSaga processing time

Expose via Prometheus endpoint:

illustrative - metrics configuration
observability:
metrics:
enabled: true
port: 9090
path: /metrics

Connect to distributed tracing systems:

illustrative - tracing configuration
observability:
tracing:
enabled: true
exporter: otlp
endpoint: http://jaeger:4317

Each command becomes a span. Child spans track:

  • State reconstruction
  • Business logic execution
  • Event persistence
  • Saga dispatch
  • Projector updates

Correlation IDs link traces across services.


When a player disputes an outcome:

illustrative - debugging commands
# Find all events for this hand
angzarr events --domain=hand --root=hand-xyz-789
# Trace the full session
angzarr events --correlation-id=session-abc-123
# Check for rejections
angzarr events --domain=hand --root=hand-xyz-789 --type=RejectionNotification
# Replay the hand to verify
angzarr replay --domain=hand --root=hand-xyz-789 --output=json

The event history is the audit trail. Disputes resolve with facts, not guesses.


Configure alerts for operational issues:

illustrative - alerting rules
alerts:
- name: dlq_depth_high
condition: angzarr_dlq_depth > 100
severity: warning
- name: rejection_rate_high
condition: rate(angzarr_commands_rejected_total[5m]) > 0.1
severity: warning
- name: replay_slow
condition: angzarr_replay_duration_seconds > 1
severity: info