Skip to content

Observability Infrastructure

How we instrument, trace, and debug z0 in production.

Prerequisites: PRINCIPLES.md, PRIMITIVES.md


z0 has two layers of observability:

LayerWhat It CapturesStorageQuery Pattern
Business EventsFacts (invocation, outcome, charge)Durable Object ledgers → D1”What happened economically?”
Execution EventsTraces, spans, metricsWorkers Trace Events → Analytics Engine”How did execution perform?”

Key Insight: Facts tell you what happened. Traces tell you how it happened. You need both to debug production issues.


Facts are the source of truth for business events. They answer:

  • What invocations occurred?
  • What outcomes resolved?
  • What charges, costs, payouts were recorded?

Already defined in Layer 0. Facts are stored in DO ledgers and replicated to D1 for reporting.

Limitation: Facts don’t capture execution details. When a charge Fact write fails, Facts alone can’t tell you why.

Traces capture the execution path of a request across services. They answer:

  • How did this request flow through the system?
  • Where did latency accumulate?
  • Which component failed and why?

Infrastructure: Cloudflare Workers Trace Events

Request → Worker → Durable Object → D1 → Response
│ │ │ │ │
└─ span ──┴── span ───┴── span ──┴─ span ─┘
└──────────────── trace_id ─────────────────┘

Trace ID Propagation:

  • Every incoming request gets a trace_id (from header or generated)
  • trace_id flows through all downstream calls
  • Facts may optionally include trace_id for correlation

Metrics capture aggregate measurements over time. They answer:

  • How many invocations per second?
  • What’s the p99 latency for routing decisions?
  • How many reconciliation failures this hour?

Infrastructure: Cloudflare Analytics Engine

  • High-cardinality support (tenant_id, asset_id, tool_id dimensions)
  • 90-day retention for operational queries
  • Sub-second query latency

Event TypeInstrumentationStorage
Fact writesFact + metricDO ledger + Analytics Engine
Fact write failuresTrace span + metric + alertWorkers Trace + Analytics Engine
Config readsMetric (count, latency)Analytics Engine
Config updatesFact (lifecycle) + metricDO ledger + Analytics Engine
Cached state readsMetric (hit/miss, latency)Analytics Engine
Cached state reconciliationFact (if mismatch) + metricDO ledger + Analytics Engine
External tool callsTrace span + Fact (invocation)Workers Trace + DO ledger
Routing decisionsTrace span + metricWorkers Trace + Analytics Engine
ErrorsTrace span + metric + alertWorkers Trace + Analytics Engine
z0.{layer}.{component}.{operation}
Examples:
z0.worker.router.evaluate_eligibility
z0.do.ledger.append_fact
z0.do.cache.read_budget_state
z0.d1.query.facts_by_tenant
z0.external.twilio.create_call
z0.external.openai.completion
z0_{component}_{measurement}_{unit}
Examples:
z0_facts_written_total // counter
z0_facts_write_duration_ms // histogram
z0_cache_hits_total // counter
z0_cache_misses_total // counter
z0_routing_decisions_total // counter
z0_reconciliation_mismatches_total // counter

Facts can optionally include trace_id for correlation:

Fact {
id: "fact_123",
type: "invocation",
...
data: {
trace_id: "abc-123-def", // Optional: links to execution trace
...
}
}

When to include trace_id:

  • Always for invocation Facts (links to tool call trace)
  • Always for error/failure scenarios
  • Optional for outcome Facts (outcome may be async)
  • Never required for charge/cost/payout (derived from outcomes)
1. Alert fires: "charge write failures elevated"
2. Check metrics:
→ z0_facts_write_failures_total by tenant_id, fact_type
→ Identify affected tenants
3. Query failed Facts:
→ SELECT * FROM facts WHERE type = 'charge' AND status = 'failed'
→ Get trace_ids from data field
4. Pull traces:
→ Query Workers Trace Events by trace_id
→ See full execution path, identify failure point
5. Root cause:
→ Trace shows: DO.append_fact → timeout after 30s
→ DO was overloaded, need to shard

When cached state diverges from Facts, we need to know.

When reconciliation finds a mismatch, record it:

Fact {
type: "reconciliation",
subtype: "mismatch_detected",
entity_id: "account_123",
timestamp: T,
data: {
cache_type: "BudgetState",
cached_value: { remaining: 500 },
calculated_value: { remaining: 450 },
delta: { remaining: -50 },
resolution: "cache_updated" // or "alert_raised", "manual_review"
}
}
z0_reconciliation_runs_total
z0_reconciliation_mismatches_total{cache_type, resolution}
z0_reconciliation_duration_ms
z0_reconciliation_facts_scanned_total
MetricWarningCritical
Mismatch rate> 0.1%> 1%
Reconciliation duration> 60s> 300s
Reconciliation failures> 0> 10/hour

Facts flow: DO Ledger → Queue → D1

z0_replication_lag_ms // Time from DO write to D1 availability
z0_replication_queue_depth // Pending Facts in queue
z0_replication_failures_total // Failed D1 writes
z0_replication_retries_total // Retried writes
MetricHealthyDegradedCritical
Lag< 1s< 10s> 60s
Queue depth< 1000< 10000> 100000
Failure rate0%< 0.1%> 1%

Level 1: Metrics threshold breach
→ Auto-generated from metric rules
→ Goes to on-call
Level 2: Anomaly detection
→ Deviation from baseline
→ Goes to on-call + engineering lead
Level 3: Business impact
→ "No charges recorded for tenant X in 1 hour"
→ Goes to on-call + tenant success + engineering lead
  • Fact write failure rate > 1%
  • Reconciliation mismatch rate > 1%
  • Replication lag > 60s
  • External tool (Twilio, OpenAI) error rate > 5%
  • Zero invocations for active tenant > 15 minutes

Warning Alerts (Slack, Review in Business Hours)

Section titled “Warning Alerts (Slack, Review in Business Hours)”
  • Fact write failure rate > 0.1%
  • Reconciliation mismatch rate > 0.1%
  • Replication lag > 10s
  • Cache miss rate > 50%
  • Config version gaps detected

Real-time view of system health:

┌─────────────────────────────────────────────────────────────┐
│ z0 Operations [Last 1h] │
├─────────────────────────────────────────────────────────────┤
│ Facts/sec Invocations Outcomes Charges Errors │
│ [==== ] [====== ] [==== ] [=====] [ ] │
│ 1,234 890 456 234 2 │
├─────────────────────────────────────────────────────────────┤
│ Latency p50/p99 Cache Hit Rate Replication │
│ 12ms / 89ms 94.2% Lag: 0.3s │
├─────────────────────────────────────────────────────────────┤
│ Active Tenants: 47 Active DOs: 2,341 Queue Depth: 89 │
└─────────────────────────────────────────────────────────────┘

Per-tenant view for debugging:

┌─────────────────────────────────────────────────────────────┐
│ Tenant: Acme Agency [Last 24h] │
├─────────────────────────────────────────────────────────────┤
│ Invocations: 12,345 Outcomes: 5,678 Revenue: $4,521 │
├─────────────────────────────────────────────────────────────┤
│ Top Assets │ Top Errors │ Budget Usage │
│ +1-555-0001: 4,521 │ timeout: 12 │ [======== ] │
│ +1-555-0002: 3,210 │ qualification_fail: 89 │ $8,000/$10k │
│ +1-555-0003: 2,100 │ routing_no_buyer: 45 │ │
└─────────────────────────────────────────────────────────────┘

Before shipping products on z0:

  • Workers Trace Events enabled on all Workers
  • Analytics Engine namespace configured
  • Fact write metrics instrumented
  • Cache read/write metrics instrumented
  • Reconciliation metrics instrumented
  • Replication metrics instrumented
  • trace_id propagation implemented
  • Critical alerts configured
  • Operations dashboard deployed
  • Runbook for each critical alert

QuestionTool
What happened?Facts (D1 query)
How did it happen?Traces (Workers Trace Events)
How often/fast?Metrics (Analytics Engine)
Is something wrong?Alerts (metric thresholds)
What’s the current state?Dashboards (real-time)

Facts alone cannot answer “why did this fail?” Traces alone cannot answer “what was the business impact?” You need both layers working together for production debugging.