Observability Infrastructure

How we instrument, trace, and debug z0 in production.

Prerequisites: PRINCIPLES.md, PRIMITIVES.md

Overview

z0 has two layers of observability:

Layer	What It Captures	Storage	Query Pattern
Business Events	Facts (invocation, outcome, charge)	Durable Object ledgers → D1	”What happened economically?”
Execution Events	Traces, spans, metrics	Workers Trace Events → Analytics Engine	”How did execution perform?”

Key Insight: Facts tell you what happened. Traces tell you how it happened. You need both to debug production issues.

The Three Pillars

1. Facts (Business Events)

Facts are the source of truth for business events. They answer:

What invocations occurred?
What outcomes resolved?
What charges, costs, payouts were recorded?

Already defined in Layer 0. Facts are stored in DO ledgers and replicated to D1 for reporting.

Limitation: Facts don’t capture execution details. When a charge Fact write fails, Facts alone can’t tell you why.

2. Traces (Execution Events)

Traces capture the execution path of a request across services. They answer:

How did this request flow through the system?
Where did latency accumulate?
Which component failed and why?

Infrastructure: Cloudflare Workers Trace Events

Request → Worker → Durable Object → D1 → Response
   │         │           │           │        │
   └─ span ──┴── span ───┴── span ──┴─ span ─┘
   └──────────────── trace_id ─────────────────┘

Trace ID Propagation:

Every incoming request gets a trace_id (from header or generated)
trace_id flows through all downstream calls
Facts may optionally include trace_id for correlation

3. Metrics (Aggregate Measurements)

Metrics capture aggregate measurements over time. They answer:

How many invocations per second?
What’s the p99 latency for routing decisions?
How many reconciliation failures this hour?

Infrastructure: Cloudflare Analytics Engine

High-cardinality support (tenant_id, asset_id, tool_id dimensions)
90-day retention for operational queries
Sub-second query latency

Instrumentation Strategy

What to Instrument

Event Type	Instrumentation	Storage
Fact writes	Fact + metric	DO ledger + Analytics Engine
Fact write failures	Trace span + metric + alert	Workers Trace + Analytics Engine
Config reads	Metric (count, latency)	Analytics Engine
Config updates	Fact (lifecycle) + metric	DO ledger + Analytics Engine
Cached state reads	Metric (hit/miss, latency)	Analytics Engine
Cached state reconciliation	Fact (if mismatch) + metric	DO ledger + Analytics Engine
External tool calls	Trace span + Fact (invocation)	Workers Trace + DO ledger
Routing decisions	Trace span + metric	Workers Trace + Analytics Engine
Errors	Trace span + metric + alert	Workers Trace + Analytics Engine

Span Naming Convention

z0.{layer}.{component}.{operation}

Examples:
z0.worker.router.evaluate_eligibility
z0.do.ledger.append_fact
z0.do.cache.read_budget_state
z0.d1.query.facts_by_tenant
z0.external.twilio.create_call
z0.external.openai.completion

Metric Naming Convention

z0_{component}_{measurement}_{unit}

Examples:
z0_facts_written_total          // counter
z0_facts_write_duration_ms      // histogram
z0_cache_hits_total             // counter
z0_cache_misses_total           // counter
z0_routing_decisions_total      // counter
z0_reconciliation_mismatches_total // counter

Correlation: Facts ↔ Traces

Linking Business Events to Execution

Facts can optionally include trace_id for correlation:

Fact {
  id: "fact_123",
  type: "invocation",
  ...
  data: {
    trace_id: "abc-123-def",  // Optional: links to execution trace
    ...
  }
}

When to include trace_id:

Always for invocation Facts (links to tool call trace)
Always for error/failure scenarios
Optional for outcome Facts (outcome may be async)
Never required for charge/cost/payout (derived from outcomes)

Debugging Workflow

1. Alert fires: "charge write failures elevated"

2. Check metrics:
   → z0_facts_write_failures_total by tenant_id, fact_type
   → Identify affected tenants

3. Query failed Facts:
   → SELECT * FROM facts WHERE type = 'charge' AND status = 'failed'
   → Get trace_ids from data field

4. Pull traces:
   → Query Workers Trace Events by trace_id
   → See full execution path, identify failure point

5. Root cause:
   → Trace shows: DO.append_fact → timeout after 30s
   → DO was overloaded, need to shard

Reconciliation Observability

When cached state diverges from Facts, we need to know.

Reconciliation Fact

When reconciliation finds a mismatch, record it:

Fact {
  type: "reconciliation",
  subtype: "mismatch_detected",
  entity_id: "account_123",
  timestamp: T,
  data: {
    cache_type: "BudgetState",
    cached_value: { remaining: 500 },
    calculated_value: { remaining: 450 },
    delta: { remaining: -50 },
    resolution: "cache_updated"  // or "alert_raised", "manual_review"
  }
}

Reconciliation Metrics

z0_reconciliation_runs_total
z0_reconciliation_mismatches_total{cache_type, resolution}
z0_reconciliation_duration_ms
z0_reconciliation_facts_scanned_total

Alert Thresholds

Metric	Warning	Critical
Mismatch rate	> 0.1%	> 1%
Reconciliation duration	> 60s	> 300s
Reconciliation failures	> 0	> 10/hour

Fact Replication Observability

Facts flow: DO Ledger → Queue → D1

Replication Metrics

z0_replication_lag_ms           // Time from DO write to D1 availability
z0_replication_queue_depth      // Pending Facts in queue
z0_replication_failures_total   // Failed D1 writes
z0_replication_retries_total    // Retried writes

Replication Health

Metric	Healthy	Degraded	Critical
Lag	< 1s	< 10s	> 60s
Queue depth	< 1000	< 10000	> 100000
Failure rate	0%	< 0.1%	> 1%

Alerting Strategy

Alert Hierarchy

Level 1: Metrics threshold breach
  → Auto-generated from metric rules
  → Goes to on-call

Level 2: Anomaly detection
  → Deviation from baseline
  → Goes to on-call + engineering lead

Level 3: Business impact
  → "No charges recorded for tenant X in 1 hour"
  → Goes to on-call + tenant success + engineering lead

Critical Alerts (Page Immediately)

Fact write failure rate > 1%
Reconciliation mismatch rate > 1%
Replication lag > 60s
External tool (Twilio, OpenAI) error rate > 5%
Zero invocations for active tenant > 15 minutes

Warning Alerts (Slack, Review in Business Hours)

Fact write failure rate > 0.1%
Reconciliation mismatch rate > 0.1%
Replication lag > 10s
Cache miss rate > 50%
Config version gaps detected

Dashboards

Operations Dashboard

Real-time view of system health:

┌─────────────────────────────────────────────────────────────┐
│ z0 Operations                                    [Last 1h]  │
├─────────────────────────────────────────────────────────────┤
│ Facts/sec    Invocations    Outcomes    Charges    Errors   │
│ [====  ]     [======  ]     [====  ]    [=====]    [  ]     │
│ 1,234        890            456         234        2        │
├─────────────────────────────────────────────────────────────┤
│ Latency p50/p99              Cache Hit Rate    Replication  │
│ 12ms / 89ms                  94.2%             Lag: 0.3s    │
├─────────────────────────────────────────────────────────────┤
│ Active Tenants: 47    Active DOs: 2,341    Queue Depth: 89  │
└─────────────────────────────────────────────────────────────┘

Tenant Dashboard

Per-tenant view for debugging:

┌─────────────────────────────────────────────────────────────┐
│ Tenant: Acme Agency                              [Last 24h] │
├─────────────────────────────────────────────────────────────┤
│ Invocations: 12,345    Outcomes: 5,678    Revenue: $4,521   │
├─────────────────────────────────────────────────────────────┤
│ Top Assets         │ Top Errors              │ Budget Usage │
│ +1-555-0001: 4,521 │ timeout: 12             │ [========  ] │
│ +1-555-0002: 3,210 │ qualification_fail: 89  │ $8,000/$10k  │
│ +1-555-0003: 2,100 │ routing_no_buyer: 45    │              │
└─────────────────────────────────────────────────────────────┘

Implementation Checklist

Before shipping products on z0:

Summary

Question	Tool
What happened?	Facts (D1 query)
How did it happen?	Traces (Workers Trace Events)
How often/fast?	Metrics (Analytics Engine)
Is something wrong?	Alerts (metric thresholds)
What’s the current state?	Dashboards (real-time)

Facts alone cannot answer “why did this fail?” Traces alone cannot answer “what was the business impact?” You need both layers working together for production debugging.