Observability Infrastructure
How we instrument, trace, and debug z0 in production.
Prerequisites: PRINCIPLES.md, PRIMITIVES.md
Overview
Section titled “Overview”z0 has two layers of observability:
| Layer | What It Captures | Storage | Query Pattern |
|---|---|---|---|
| Business Events | Facts (invocation, outcome, charge) | Durable Object ledgers → D1 | ”What happened economically?” |
| Execution Events | Traces, spans, metrics | Workers Trace Events → Analytics Engine | ”How did execution perform?” |
Key Insight: Facts tell you what happened. Traces tell you how it happened. You need both to debug production issues.
The Three Pillars
Section titled “The Three Pillars”1. Facts (Business Events)
Section titled “1. Facts (Business Events)”Facts are the source of truth for business events. They answer:
- What invocations occurred?
- What outcomes resolved?
- What charges, costs, payouts were recorded?
Already defined in Layer 0. Facts are stored in DO ledgers and replicated to D1 for reporting.
Limitation: Facts don’t capture execution details. When a charge Fact write fails, Facts alone can’t tell you why.
2. Traces (Execution Events)
Section titled “2. Traces (Execution Events)”Traces capture the execution path of a request across services. They answer:
- How did this request flow through the system?
- Where did latency accumulate?
- Which component failed and why?
Infrastructure: Cloudflare Workers Trace Events
Request → Worker → Durable Object → D1 → Response │ │ │ │ │ └─ span ──┴── span ───┴── span ──┴─ span ─┘ └──────────────── trace_id ─────────────────┘Trace ID Propagation:
- Every incoming request gets a trace_id (from header or generated)
- trace_id flows through all downstream calls
- Facts may optionally include trace_id for correlation
3. Metrics (Aggregate Measurements)
Section titled “3. Metrics (Aggregate Measurements)”Metrics capture aggregate measurements over time. They answer:
- How many invocations per second?
- What’s the p99 latency for routing decisions?
- How many reconciliation failures this hour?
Infrastructure: Cloudflare Analytics Engine
- High-cardinality support (tenant_id, asset_id, tool_id dimensions)
- 90-day retention for operational queries
- Sub-second query latency
Instrumentation Strategy
Section titled “Instrumentation Strategy”What to Instrument
Section titled “What to Instrument”| Event Type | Instrumentation | Storage |
|---|---|---|
| Fact writes | Fact + metric | DO ledger + Analytics Engine |
| Fact write failures | Trace span + metric + alert | Workers Trace + Analytics Engine |
| Config reads | Metric (count, latency) | Analytics Engine |
| Config updates | Fact (lifecycle) + metric | DO ledger + Analytics Engine |
| Cached state reads | Metric (hit/miss, latency) | Analytics Engine |
| Cached state reconciliation | Fact (if mismatch) + metric | DO ledger + Analytics Engine |
| External tool calls | Trace span + Fact (invocation) | Workers Trace + DO ledger |
| Routing decisions | Trace span + metric | Workers Trace + Analytics Engine |
| Errors | Trace span + metric + alert | Workers Trace + Analytics Engine |
Span Naming Convention
Section titled “Span Naming Convention”z0.{layer}.{component}.{operation}
Examples:z0.worker.router.evaluate_eligibilityz0.do.ledger.append_factz0.do.cache.read_budget_statez0.d1.query.facts_by_tenantz0.external.twilio.create_callz0.external.openai.completionMetric Naming Convention
Section titled “Metric Naming Convention”z0_{component}_{measurement}_{unit}
Examples:z0_facts_written_total // counterz0_facts_write_duration_ms // histogramz0_cache_hits_total // counterz0_cache_misses_total // counterz0_routing_decisions_total // counterz0_reconciliation_mismatches_total // counterCorrelation: Facts ↔ Traces
Section titled “Correlation: Facts ↔ Traces”Linking Business Events to Execution
Section titled “Linking Business Events to Execution”Facts can optionally include trace_id for correlation:
Fact { id: "fact_123", type: "invocation", ... data: { trace_id: "abc-123-def", // Optional: links to execution trace ... }}When to include trace_id:
- Always for invocation Facts (links to tool call trace)
- Always for error/failure scenarios
- Optional for outcome Facts (outcome may be async)
- Never required for charge/cost/payout (derived from outcomes)
Debugging Workflow
Section titled “Debugging Workflow”1. Alert fires: "charge write failures elevated"
2. Check metrics: → z0_facts_write_failures_total by tenant_id, fact_type → Identify affected tenants
3. Query failed Facts: → SELECT * FROM facts WHERE type = 'charge' AND status = 'failed' → Get trace_ids from data field
4. Pull traces: → Query Workers Trace Events by trace_id → See full execution path, identify failure point
5. Root cause: → Trace shows: DO.append_fact → timeout after 30s → DO was overloaded, need to shardReconciliation Observability
Section titled “Reconciliation Observability”When cached state diverges from Facts, we need to know.
Reconciliation Fact
Section titled “Reconciliation Fact”When reconciliation finds a mismatch, record it:
Fact { type: "reconciliation", subtype: "mismatch_detected", entity_id: "account_123", timestamp: T, data: { cache_type: "BudgetState", cached_value: { remaining: 500 }, calculated_value: { remaining: 450 }, delta: { remaining: -50 }, resolution: "cache_updated" // or "alert_raised", "manual_review" }}Reconciliation Metrics
Section titled “Reconciliation Metrics”z0_reconciliation_runs_totalz0_reconciliation_mismatches_total{cache_type, resolution}z0_reconciliation_duration_msz0_reconciliation_facts_scanned_totalAlert Thresholds
Section titled “Alert Thresholds”| Metric | Warning | Critical |
|---|---|---|
| Mismatch rate | > 0.1% | > 1% |
| Reconciliation duration | > 60s | > 300s |
| Reconciliation failures | > 0 | > 10/hour |
Fact Replication Observability
Section titled “Fact Replication Observability”Facts flow: DO Ledger → Queue → D1
Replication Metrics
Section titled “Replication Metrics”z0_replication_lag_ms // Time from DO write to D1 availabilityz0_replication_queue_depth // Pending Facts in queuez0_replication_failures_total // Failed D1 writesz0_replication_retries_total // Retried writesReplication Health
Section titled “Replication Health”| Metric | Healthy | Degraded | Critical |
|---|---|---|---|
| Lag | < 1s | < 10s | > 60s |
| Queue depth | < 1000 | < 10000 | > 100000 |
| Failure rate | 0% | < 0.1% | > 1% |
Alerting Strategy
Section titled “Alerting Strategy”Alert Hierarchy
Section titled “Alert Hierarchy”Level 1: Metrics threshold breach → Auto-generated from metric rules → Goes to on-call
Level 2: Anomaly detection → Deviation from baseline → Goes to on-call + engineering lead
Level 3: Business impact → "No charges recorded for tenant X in 1 hour" → Goes to on-call + tenant success + engineering leadCritical Alerts (Page Immediately)
Section titled “Critical Alerts (Page Immediately)”- Fact write failure rate > 1%
- Reconciliation mismatch rate > 1%
- Replication lag > 60s
- External tool (Twilio, OpenAI) error rate > 5%
- Zero invocations for active tenant > 15 minutes
Warning Alerts (Slack, Review in Business Hours)
Section titled “Warning Alerts (Slack, Review in Business Hours)”- Fact write failure rate > 0.1%
- Reconciliation mismatch rate > 0.1%
- Replication lag > 10s
- Cache miss rate > 50%
- Config version gaps detected
Dashboards
Section titled “Dashboards”Operations Dashboard
Section titled “Operations Dashboard”Real-time view of system health:
┌─────────────────────────────────────────────────────────────┐│ z0 Operations [Last 1h] │├─────────────────────────────────────────────────────────────┤│ Facts/sec Invocations Outcomes Charges Errors ││ [==== ] [====== ] [==== ] [=====] [ ] ││ 1,234 890 456 234 2 │├─────────────────────────────────────────────────────────────┤│ Latency p50/p99 Cache Hit Rate Replication ││ 12ms / 89ms 94.2% Lag: 0.3s │├─────────────────────────────────────────────────────────────┤│ Active Tenants: 47 Active DOs: 2,341 Queue Depth: 89 │└─────────────────────────────────────────────────────────────┘Tenant Dashboard
Section titled “Tenant Dashboard”Per-tenant view for debugging:
┌─────────────────────────────────────────────────────────────┐│ Tenant: Acme Agency [Last 24h] │├─────────────────────────────────────────────────────────────┤│ Invocations: 12,345 Outcomes: 5,678 Revenue: $4,521 │├─────────────────────────────────────────────────────────────┤│ Top Assets │ Top Errors │ Budget Usage ││ +1-555-0001: 4,521 │ timeout: 12 │ [======== ] ││ +1-555-0002: 3,210 │ qualification_fail: 89 │ $8,000/$10k ││ +1-555-0003: 2,100 │ routing_no_buyer: 45 │ │└─────────────────────────────────────────────────────────────┘Implementation Checklist
Section titled “Implementation Checklist”Before shipping products on z0:
- Workers Trace Events enabled on all Workers
- Analytics Engine namespace configured
- Fact write metrics instrumented
- Cache read/write metrics instrumented
- Reconciliation metrics instrumented
- Replication metrics instrumented
- trace_id propagation implemented
- Critical alerts configured
- Operations dashboard deployed
- Runbook for each critical alert
Summary
Section titled “Summary”| Question | Tool |
|---|---|
| What happened? | Facts (D1 query) |
| How did it happen? | Traces (Workers Trace Events) |
| How often/fast? | Metrics (Analytics Engine) |
| Is something wrong? | Alerts (metric thresholds) |
| What’s the current state? | Dashboards (real-time) |
Facts alone cannot answer “why did this fail?” Traces alone cannot answer “what was the business impact?” You need both layers working together for production debugging.