Error Handling & Retry Patterns
Graceful failure, automatic recovery, and data integrity for the z0 platform.
Prerequisites: PRINCIPLES.md (Principle 2: Facts Are Immutable, Principle 8: Errors Are First-Class), PRIMITIVES.md, ERROR-HANDLING.md
Overview
Section titled “Overview”Error handling in z0 is not exceptional—it’s expected. Every external call can fail. Every network request can timeout. Every rate limit can be exceeded. The question is not “will this fail?” but “how will this fail safely?”
| Principle | Error Handling Implication |
|---|---|
| Principle 2: Facts Are Immutable | Retries must be idempotent; can’t “undo” written Facts |
| Principle 8: Errors Are First-Class | Failures are Facts, tracked for economics and debugging |
| Principle 1: Economics Must Close | Failed operations with costs still need tracking |
| Principle 6: Configs Are Versioned | Retries must use original config_version |
Key Insight: The z0 platform distinguishes between retryable errors (transient failures) and terminal errors (won’t succeed on retry). Retrying terminal errors wastes time and money. Not retrying transient errors loses revenue.
Error Taxonomy
Section titled “Error Taxonomy”Categories
Section titled “Categories”| Category | Retry? | Examples | HTTP Status |
|---|---|---|---|
| Transient | Yes | Network timeout, 503, rate limited | 429, 503, 504 |
| Client Error | No | 400, 404, validation failure | 400, 404, 422 |
| Server Error | Maybe | 500 (depends on idempotency) | 500, 502 |
| External Failure | Yes (with backoff) | Twilio down, CRM timeout | Varies |
| Data Integrity | No | Duplicate ID, constraint violation | 409, 422 |
| Authorization | No | Insufficient permissions, expired token | 401, 403 |
| Budget | No | Budget exhausted (Principle 10) | 402 |
Decision Matrix
Section titled “Decision Matrix”function isRetryable(error: APIError): boolean { // Transient errors: retry if ([429, 503, 504].includes(error.status)) { return true; }
// Server errors: retry if idempotent if ([500, 502].includes(error.status)) { return true; // Assuming idempotency key used }
// Client errors: don't retry if (error.status >= 400 && error.status < 500) { return false; }
// Network errors: retry if (error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET') { return true; }
return false;}Fact Recording on Error
Section titled “Fact Recording on Error”Per Principle 8, errors are recorded as Facts:
interface ErrorFact { type: 'error'; subtype: 'tool_failure' | 'timeout' | 'validation' | 'external' | 'rate_limit'; timestamp: number;
tenant_id: string; entity_id: string; tool_id?: string;
data: { error_code: string; error_message: string; error_type: string; // From taxonomy retry_attempt: number; max_retries: number; resolved: boolean; resolution_fact_id?: string; trace_id?: string;
// Economics (Principle 1: track costs even on failure) cost_incurred?: number; currency?: string; };}Example:
await appendFact({ type: 'error', subtype: 'external', timestamp: Date.now(), tenant_id: 'ten_abc123', entity_id: 'asset_xyz789', tool_id: 'tool_twilio_voice',
data: { error_code: 'ETIMEDOUT', error_message: 'Twilio API timeout after 30s', error_type: 'transient', retry_attempt: 2, max_retries: 5, resolved: false, cost_incurred: 0.005, // Twilio charged us anyway currency: 'USD', trace_id: 'trace_abc123' }});Idempotency
Section titled “Idempotency”The Problem
Section titled “The Problem”Facts are immutable (Principle 2). Retrying a Fact append without idempotency creates duplicates:
T1: Client sends Fact → timeout before responseT2: Client retries → Fact appended againResult: Duplicate Facts, incorrect economicsIdempotency Keys
Section titled “Idempotency Keys”Every write operation must include an idempotency key:
interface IdempotencyKey { key: string; // Unique per request expires_at: number; // TTL for storage created_at: number; result?: any; // Cached response status: 'pending' | 'completed' | 'failed';}Storage Mechanism
Section titled “Storage Mechanism”Critical: Idempotency keys MUST be stored in SQLite, not just in-memory. In-memory Maps are lost on DO eviction.
-- Add to DO schema initializationCREATE TABLE IF NOT EXISTS idempotency_keys ( key TEXT PRIMARY KEY, status TEXT NOT NULL CHECK(status IN ('pending', 'completed', 'failed')), result TEXT, -- JSON blob of cached response created_at INTEGER NOT NULL, expires_at INTEGER NOT NULL);
CREATE INDEX IF NOT EXISTS idx_idempotency_expiry ON idempotency_keys(expires_at);Implementation Pattern
Section titled “Implementation Pattern”class DurableObjectWithIdempotency { // In-memory cache for hot path (backed by SQLite) private idempotencyHotCache: Map<string, IdempotencyKey> = new Map();
async appendFact(fact: Fact, idempotencyKey: string): Promise<FactResult> { // 1. Check hot cache first (fast path) let existing = this.idempotencyHotCache.get(idempotencyKey);
// 2. Fall back to SQLite (persistent storage) if (!existing) { const row = await this.sql.exec( `SELECT key, status, result, created_at, expires_at FROM idempotency_keys WHERE key = ?`, [idempotencyKey] ).first();
if (row) { existing = { key: row.key, status: row.status, result: row.result ? JSON.parse(row.result) : undefined, created_at: row.created_at, expires_at: row.expires_at }; // Populate hot cache this.idempotencyHotCache.set(idempotencyKey, existing); } }
if (existing) { if (existing.status === 'pending') { // Request in progress - return 409 Conflict throw new IdempotencyConflictError(idempotencyKey); } // Return cached result (handles retry of completed request) return existing.result; }
// 3. Mark as pending (persist to SQLite) const now = Date.now(); const expiresAt = now + 86400000; // 24 hours
await this.sql.exec( `INSERT INTO idempotency_keys (key, status, created_at, expires_at) VALUES (?, 'pending', ?, ?)`, [idempotencyKey, now, expiresAt] );
this.idempotencyHotCache.set(idempotencyKey, { key: idempotencyKey, status: 'pending', created_at: now, expires_at: expiresAt });
try { // 4. Append fact (immutable) const result = await this.doAppendFact(fact);
// 5. Record success (persist to SQLite) await this.sql.exec( `UPDATE idempotency_keys SET status = 'completed', result = ? WHERE key = ?`, [JSON.stringify(result), idempotencyKey] );
this.idempotencyHotCache.set(idempotencyKey, { key: idempotencyKey, status: 'completed', created_at: now, expires_at: expiresAt, result });
return result;
} catch (error) { // 6. Handle failure if (!isRetryable(error)) { // Terminal error - cache to prevent retries await this.sql.exec( `UPDATE idempotency_keys SET status = 'failed', result = ? WHERE key = ?`, [JSON.stringify({ error: error.message }), idempotencyKey] );
this.idempotencyHotCache.set(idempotencyKey, { key: idempotencyKey, status: 'failed', created_at: now, expires_at: expiresAt, result: { error: error.message } }); } else { // Retryable error - remove pending status to allow retry await this.sql.exec( `DELETE FROM idempotency_keys WHERE key = ?`, [idempotencyKey] ); this.idempotencyHotCache.delete(idempotencyKey); } throw error; } }
// Cleanup expired keys (called by reconciliation alarm) async cleanupIdempotencyCache(): Promise<void> { const now = Date.now();
// Delete from SQLite await this.sql.exec( `DELETE FROM idempotency_keys WHERE expires_at < ?`, [now] );
// Clear expired from hot cache for (const [key, value] of this.idempotencyHotCache.entries()) { if (value.expires_at < now) { this.idempotencyHotCache.delete(key); } } }}Client-Side Usage
Section titled “Client-Side Usage”async function createFactWithIdempotency(factData: Fact): Promise<FactResult> { // Generate idempotency key (deterministic per logical operation) const idempotencyKey = `fact_${factData.type}_${factData.timestamp}_${factData.entity_id}`;
const response = await fetch('/v1/facts', { method: 'POST', headers: { 'X-API-Key': apiKey, 'Idempotency-Key': idempotencyKey, 'Content-Type': 'application/json' }, body: JSON.stringify(factData) });
if (response.ok) { return await response.json(); }
const error = await response.json();
// If conflict, request in progress - wait and retry with SAME key if (response.status === 409 && error.type.includes('idempotency-conflict')) { await sleep(2000); return createFactWithIdempotency(factData); // Retry with same key }
throw new APIError(error);}Idempotency Key Generation
Section titled “Idempotency Key Generation”| Operation | Key Format | Example |
|---|---|---|
| Fact append | fact_{type}_{timestamp}_{entity_id} | fact_invocation_1705500000000_ent_abc123 |
| Entity create | entity_{type}_{identifier}_{timestamp} | entity_asset_+15550001_1705500000000 |
| Config update | config_{id}_{version}_{timestamp} | config_cfg_123_5_1705500000000 |
| External API | external_{tool}_{operation}_{nonce} | external_twilio_call_abc123xyz |
Rules:
- Deterministic: Same logical operation always generates same key
- Unique: Different operations never collide
- Bounded: Include timestamp to enable expiration
- Human-readable: Debug-friendly format
Retry Strategies
Section titled “Retry Strategies”Exponential Backoff with Jitter
Section titled “Exponential Backoff with Jitter”interface RetryConfig { baseDelayMs: number; // Starting delay (e.g., 1000) maxDelayMs: number; // Ceiling (e.g., 60000) maxAttempts: number; // Give up after N attempts jitterRatio: number; // Random variance (0-1) backoffMultiplier: number; // Growth rate (typically 2)}
function calculateBackoff( attempt: number, config: RetryConfig): number { // Exponential: delay = base * multiplier^attempt const exponential = config.baseDelayMs * Math.pow( config.backoffMultiplier, attempt );
// Cap at max delay const capped = Math.min(exponential, config.maxDelayMs);
// Add jitter: random variance to prevent thundering herd const jitter = capped * config.jitterRatio * Math.random();
return capped + jitter;}Example progression:
Attempt 1: 1000ms + jitter (0-500ms) = 1000-1500msAttempt 2: 2000ms + jitter (0-1000ms) = 2000-3000msAttempt 3: 4000ms + jitter (0-2000ms) = 4000-6000msAttempt 4: 8000ms + jitter (0-4000ms) = 8000-12000msAttempt 5: 16000ms + jitter (0-8000ms) = 16000-24000msRetry Wrapper
Section titled “Retry Wrapper”async function withRetry<T>( operation: () => Promise<T>, config: RetryConfig, context: { operationName: string; entityId?: string; tenantId?: string; }): Promise<T> { let lastError: Error;
for (let attempt = 0; attempt < config.maxAttempts; attempt++) { try { // Attempt operation const result = await operation();
// Success - record if this was a retry if (attempt > 0) { await recordRetrySuccess(context, attempt); }
return result;
} catch (error) { lastError = error;
// Check if retryable if (!isRetryable(error)) { await recordTerminalError(context, error, attempt); throw error; }
// Check if final attempt if (attempt === config.maxAttempts - 1) { await recordMaxRetriesExceeded(context, error, attempt); throw new MaxRetriesExceededError( `${context.operationName} failed after ${config.maxAttempts} attempts`, lastError ); }
// Calculate backoff const delayMs = calculateBackoff(attempt, config);
// Record retry await recordRetryAttempt(context, error, attempt, delayMs);
// Wait before retry await sleep(delayMs); } }
throw lastError;}Retry Policies by Operation
Section titled “Retry Policies by Operation”| Operation | Base Delay | Max Delay | Max Attempts | Rationale |
|---|---|---|---|---|
| Fact append | 1000ms | 60000ms | 3 | Fast fail, Facts critical |
| Entity create | 1000ms | 60000ms | 3 | Low retry, avoid duplicates |
| Config update | 1000ms | 30000ms | 3 | Fast fail, conflicts likely |
| External API (Twilio) | 2000ms | 120000ms | 5 | Higher tolerance, expensive to lose |
| Webhook delivery | 5000ms | 300000ms | 5 | Eventual consistency OK |
| CRM sync | 10000ms | 600000ms | 10 | Very eventual, high value |
Context-Aware Retry
Section titled “Context-Aware Retry”Different contexts need different retry behavior:
function getRetryConfig(context: OperationContext): RetryConfig { // Real-time operations: fail fast if (context.priority === 'realtime') { return { baseDelayMs: 500, maxDelayMs: 5000, maxAttempts: 2, jitterRatio: 0.5, backoffMultiplier: 2 }; }
// Background operations: tolerate more delay if (context.priority === 'background') { return { baseDelayMs: 5000, maxDelayMs: 300000, maxAttempts: 10, jitterRatio: 0.5, backoffMultiplier: 2 }; }
// Default: balanced return { baseDelayMs: 1000, maxDelayMs: 60000, maxAttempts: 5, jitterRatio: 0.5, backoffMultiplier: 2 };}Circuit Breaker
Section titled “Circuit Breaker”Circuit breakers prevent cascading failures when external services are down.
States
Section titled “States”CLOSED → Normal operation, requests pass through │ ├─ Too many failures → OPEN │OPEN → Fast-fail all requests, don't call service │ ├─ Timeout elapsed → HALF-OPEN │HALF-OPEN → Allow one test request │ ├─ Success → CLOSED └─ Failure → OPENImplementation
Section titled “Implementation”interface CircuitBreakerConfig { failureThreshold: number; // Open after N failures (e.g., 5) successThreshold: number; // Close after N successes (e.g., 2) timeout: number; // Stay open for N ms (e.g., 60000) volumeThreshold: number; // Min requests before opening (e.g., 10)}
class CircuitBreaker { private state: 'CLOSED' | 'OPEN' | 'HALF-OPEN' = 'CLOSED'; private failureCount = 0; private successCount = 0; private lastFailureTime = 0; private requestCount = 0;
constructor( private name: string, private config: CircuitBreakerConfig ) {}
async execute<T>(operation: () => Promise<T>): Promise<T> { // Check circuit state if (this.state === 'OPEN') { // Check if timeout elapsed if (Date.now() - this.lastFailureTime > this.config.timeout) { this.state = 'HALF-OPEN'; this.successCount = 0; } else { throw new CircuitOpenError( `Circuit breaker ${this.name} is OPEN` ); } }
try { // Execute operation const result = await operation();
// Record success this.onSuccess();
return result;
} catch (error) { // Record failure this.onFailure();
throw error; } }
private onSuccess(): void { this.requestCount++;
if (this.state === 'HALF-OPEN') { this.successCount++;
if (this.successCount >= this.config.successThreshold) { // Enough successes - close circuit this.state = 'CLOSED'; this.failureCount = 0; this.successCount = 0; this.recordStateChange('CLOSED', 'threshold_met'); } } else if (this.state === 'CLOSED') { // Reset failure count on success this.failureCount = 0; } }
private onFailure(): void { this.requestCount++; this.failureCount++; this.lastFailureTime = Date.now();
if (this.state === 'HALF-OPEN') { // Test request failed - reopen this.state = 'OPEN'; this.successCount = 0; this.recordStateChange('OPEN', 'half_open_failed');
} else if (this.state === 'CLOSED') { // Check if should open if ( this.requestCount >= this.config.volumeThreshold && this.failureCount >= this.config.failureThreshold ) { this.state = 'OPEN'; this.recordStateChange('OPEN', 'threshold_exceeded'); } } }
private async recordStateChange( newState: string, reason: string ): Promise<void> { // Record circuit breaker state change as Fact await appendFact({ type: 'lifecycle', subtype: 'circuit_breaker_state_changed', timestamp: Date.now(), data: { circuit_name: this.name, previous_state: this.state, new_state: newState, reason, failure_count: this.failureCount, success_count: this.successCount, request_count: this.requestCount } }); }
getState(): string { return this.state; }
getMetrics() { return { state: this.state, failureCount: this.failureCount, successCount: this.successCount, requestCount: this.requestCount, lastFailureTime: this.lastFailureTime }; }}Circuit Breaker Registry
Section titled “Circuit Breaker Registry”class CircuitBreakerRegistry { private breakers: Map<string, CircuitBreaker> = new Map();
get(name: string, config?: CircuitBreakerConfig): CircuitBreaker { if (!this.breakers.has(name)) { const defaultConfig: CircuitBreakerConfig = { failureThreshold: 5, successThreshold: 2, timeout: 60000, volumeThreshold: 10 };
this.breakers.set( name, new CircuitBreaker(name, config || defaultConfig) ); }
return this.breakers.get(name)!; }
getMetrics(): Record<string, any> { const metrics: Record<string, any> = {}; for (const [name, breaker] of this.breakers.entries()) { metrics[name] = breaker.getMetrics(); } return metrics; }}
// Global registryconst circuitBreakers = new CircuitBreakerRegistry();Usage with External APIs
Section titled “Usage with External APIs”async function callTwilioAPI(params: TwilioCallParams): Promise<CallResult> { const breaker = circuitBreakers.get('twilio_voice', { failureThreshold: 5, successThreshold: 2, timeout: 60000, volumeThreshold: 10 });
try { return await breaker.execute(async () => { // Wrapped with retry logic return await withRetry( () => twilioClient.calls.create(params), { baseDelayMs: 2000, maxDelayMs: 120000, maxAttempts: 5, jitterRatio: 0.5, backoffMultiplier: 2 }, { operationName: 'twilio_call_create', entityId: params.assetId, tenantId: params.tenantId } ); });
} catch (error) { if (error instanceof CircuitOpenError) { // Circuit breaker is open - record as error Fact await appendFact({ type: 'error', subtype: 'external', timestamp: Date.now(), tool_id: 'tool_twilio_voice', data: { error_code: 'CIRCUIT_OPEN', error_message: 'Twilio circuit breaker is open', error_type: 'circuit_breaker', circuit_state: 'OPEN', retry_attempt: 0, max_retries: 0, resolved: false } });
// Use fallback behavior return handleTwilioUnavailable(params); }
throw error; }}Dead Letter Queue
Section titled “Dead Letter Queue”When all retries fail, messages go to the dead letter queue for manual intervention.
DLQ Pattern
Section titled “DLQ Pattern”interface DeadLetterMessage { id: string; original_queue: string; message: any; error: { code: string; message: string; stack?: string; }; attempts: number; first_attempt_at: number; last_attempt_at: number; sent_to_dlq_at: number; metadata: { tenant_id?: string; entity_id?: string; trace_id?: string; };}
async function handleMaxRetriesExceeded( message: QueueMessage, error: Error, attempts: number): Promise<void> { // 1. Record failure as Fact (Principle 8: Errors Are First-Class) await appendFact({ type: 'error', subtype: 'max_retries_exceeded', timestamp: Date.now(), tenant_id: message.tenantId, entity_id: message.entityId, data: { error_code: error.code || 'UNKNOWN', error_message: error.message, error_type: 'terminal', retry_attempt: attempts, max_retries: attempts, resolved: false, queue_name: message.queue, message_id: message.id } });
// 2. Send to dead letter queue const dlqMessage: DeadLetterMessage = { id: `dlq_${message.id}`, original_queue: message.queue, message: message.body, error: { code: error.code || 'UNKNOWN', message: error.message, stack: error.stack }, attempts, first_attempt_at: message.timestamp, last_attempt_at: Date.now(), sent_to_dlq_at: Date.now(), metadata: { tenant_id: message.tenantId, entity_id: message.entityId, trace_id: message.traceId } };
await env.DLQ.send(dlqMessage);
// 3. Alert operations team await alertOps({ severity: 'warning', title: 'Message sent to DLQ', message: `Queue ${message.queue} message ${message.id} failed after ${attempts} attempts`, metadata: dlqMessage.metadata });}DLQ Processing Worker
Section titled “DLQ Processing Worker”export default { async queue(batch: MessageBatch<DeadLetterMessage>, env: Env): Promise<void> { for (const message of batch.messages) { try { // Analyze failure pattern const pattern = await analyzeFailurePattern(message);
if (pattern.type === 'transient_resolved') { // Service recovered - retry await retryFromDLQ(message, env); } else if (pattern.type === 'configuration_error') { // Alert for manual fix await alertForManualIntervention(message, pattern); } else if (pattern.type === 'data_corruption') { // Archive for analysis await archiveCorruptedMessage(message); }
message.ack();
} catch (error) { // DLQ processing failed - log and continue console.error('DLQ processing error:', error); message.retry(); } } }};DLQ Retry Interface
Section titled “DLQ Retry Interface”async function retryFromDLQ( dlqMessage: DeadLetterMessage, env: Env): Promise<void> { // Record retry decision await appendFact({ type: 'lifecycle', subtype: 'dlq_retry', timestamp: Date.now(), data: { dlq_message_id: dlqMessage.id, original_queue: dlqMessage.original_queue, reason: 'manual_retry' } });
// Re-enqueue to original queue await env.QUEUES[dlqMessage.original_queue].send( dlqMessage.message, { contentType: 'json', headers: { 'X-Retry-From-DLQ': 'true', 'X-Original-Message-ID': dlqMessage.id, 'X-DLQ-Attempts': dlqMessage.attempts.toString() } } );}Failure Facts
Section titled “Failure Facts”Per Principle 1 (Economics Must Close the Loop), failures with economic impact need Facts.
Cost Recording on Failure
Section titled “Cost Recording on Failure”async function recordFailureWithCost( operation: string, error: Error, context: { tool_id: string; asset_id: string; tenant_id: string; config_id: string; config_version: number; }, cost: { amount: number; currency: string; }): Promise<void> { // 1. Record error Fact const errorFactId = await appendFact({ type: 'error', subtype: 'tool_failure', timestamp: Date.now(), tenant_id: context.tenant_id, tool_id: context.tool_id, asset_id: context.asset_id, config_id: context.config_id, config_version: context.config_version, data: { error_code: error.code || 'UNKNOWN', error_message: error.message, error_type: determineErrorType(error), operation, resolved: false } });
// 2. Record cost Fact (even though operation failed) await appendFact({ type: 'cost', subtype: 'tool_usage', timestamp: Date.now(), tenant_id: context.tenant_id, tool_id: context.tool_id, asset_id: context.asset_id, from_entity: context.tenant_id, to_entity: 'vendor_twilio', amount: cost.amount, currency: cost.currency, config_id: context.config_id, config_version: context.config_version, data: { operation, failed: true, error_fact_id: errorFactId } });}Example: Twilio Call Failure
Section titled “Example: Twilio Call Failure”async function handleTwilioCallFailure( callParams: TwilioCallParams, error: TwilioError): Promise<void> { // Twilio charges us for failed calls const cost = calculateTwilioCost(callParams.duration || 0);
await recordFailureWithCost( 'twilio_call_create', error, { tool_id: 'tool_twilio_voice', asset_id: callParams.assetId, tenant_id: callParams.tenantId, config_id: callParams.configId, config_version: callParams.configVersion }, { amount: cost, currency: 'USD' } );
// Economic loop still closes - we track the cost even though call failed}Config Versioning on Retry
Section titled “Config Versioning on Retry”Per Principle 6 (Configs Are Versioned), retries must use the original config_version.
Problem
Section titled “Problem”T1: Client sends request using pricing Config v3T2: Request times outT3: Pricing Config updated to v4T4: Retry uses v4 (WRONG)Result: Inconsistent pricing applied to same operationSolution
Section titled “Solution”interface RetryContext { operation: string; idempotencyKey: string; config_id: string; config_version: number; // Lock to original version tenant_id: string; entity_id: string; created_at: number;}
async function executeWithConfigVersion<T>( operation: () => Promise<T>, context: RetryContext, retryConfig: RetryConfig): Promise<T> { return await withRetry( async () => { // Always use original config_version const config = await getConfigVersion( context.config_id, context.config_version );
// Execute with locked config return await operation(); }, retryConfig, { operationName: context.operation, entityId: context.entity_id, tenantId: context.tenant_id } );}Fact Recording with Config Version
Section titled “Fact Recording with Config Version”async function appendFactWithRetry( fact: Fact, idempotencyKey: string): Promise<FactResult> { // Lock config version at operation start const context: RetryContext = { operation: 'append_fact', idempotencyKey, config_id: fact.config_id, config_version: fact.config_version, // Lock to this version tenant_id: fact.tenant_id, entity_id: fact.entity_id, created_at: Date.now() };
return await executeWithConfigVersion( () => durableObject.appendFact(fact, idempotencyKey), context, { baseDelayMs: 1000, maxDelayMs: 60000, maxAttempts: 3, jitterRatio: 0.5, backoffMultiplier: 2 } );}Observability
Section titled “Observability”Metrics
Section titled “Metrics”Track error patterns for operational insight:
interface ErrorMetrics { // Counters errors_total: Counter; errors_by_type: Counter; errors_by_tool: Counter; retries_total: Counter; retries_successful: Counter; circuit_breaker_opens: Counter; dlq_messages: Counter;
// Histograms retry_duration: Histogram; retry_attempts: Histogram;
// Gauges circuit_breaker_state: Gauge; dlq_depth: Gauge;}
// Record errormetrics.errors_total.inc();metrics.errors_by_type.inc({ type: error.type });metrics.errors_by_tool.inc({ tool_id: error.tool_id });
// Record retrymetrics.retries_total.inc();metrics.retry_duration.observe(duration);metrics.retry_attempts.observe(attempts);
if (success) { metrics.retries_successful.inc();}Alerts
Section titled “Alerts”Configure alerts for error patterns:
const alertRules = [ { name: 'high_error_rate', condition: 'errors_total rate > 0.01', // 1% error rate severity: 'warning', message: 'Error rate exceeds threshold' }, { name: 'circuit_breaker_open', condition: 'circuit_breaker_state == OPEN', severity: 'critical', message: 'Circuit breaker open for {circuit_name}' }, { name: 'dlq_growing', condition: 'dlq_depth > 100 AND rate(dlq_depth) > 0', severity: 'warning', message: 'Dead letter queue growing' }, { name: 'retry_exhaustion', condition: 'rate(dlq_messages) > 0.001', // Messages hitting DLQ severity: 'warning', message: 'Messages failing all retries' }];Tracing
Section titled “Tracing”Link errors across retries with trace IDs:
interface TraceContext { trace_id: string; span_id: string; parent_span_id?: string; operation: string; start_time: number;}
async function tracedOperation<T>( operation: () => Promise<T>, context: TraceContext): Promise<T> { const start = Date.now();
try { const result = await operation();
// Record successful span await recordSpan({ ...context, duration: Date.now() - start, status: 'success' });
return result;
} catch (error) { // Record error span await recordSpan({ ...context, duration: Date.now() - start, status: 'error', error: { code: error.code, message: error.message } });
throw error; }}Anti-Patterns
Section titled “Anti-Patterns”1. Retrying Terminal Errors
Section titled “1. Retrying Terminal Errors”Wrong:
// DON'T: Retry validation errorsasync function createEntity(data) { return await withRetry( () => fetch('/v1/entities', { method: 'POST', body: data }), { maxAttempts: 5 } // Will retry 400 errors );}Right:
// DO: Only retry transient errorsasync function createEntity(data) { const response = await fetch('/v1/entities', { method: 'POST', body: data });
if (!response.ok) { const error = await response.json();
// Don't retry client errors if (response.status >= 400 && response.status < 500) { throw new ValidationError(error); }
// Retry server errors if (response.status >= 500) { return await withRetry( () => fetch('/v1/entities', { method: 'POST', body: data }), { maxAttempts: 3 } ); } }
return await response.json();}2. Missing Idempotency Keys
Section titled “2. Missing Idempotency Keys”Wrong:
// DON'T: Retry without idempotencyasync function appendFact(fact) { return await withRetry( () => durableObject.appendFact(fact), // No idempotency key { maxAttempts: 3 } ); // Can create duplicate Facts on timeout}Right:
// DO: Always use idempotency keysasync function appendFact(fact) { const idempotencyKey = generateIdempotencyKey(fact);
return await withRetry( () => durableObject.appendFact(fact, idempotencyKey), { maxAttempts: 3 } );}3. Ignoring Config Version
Section titled “3. Ignoring Config Version”Wrong:
// DON'T: Fetch latest config on retryasync function processWithConfig(entity) { return await withRetry(async () => { const config = await getLatestConfig(entity.id); // Might change between retries return await process(entity, config); });}Right:
// DO: Lock config version at operation startasync function processWithConfig(entity) { const config = await getLatestConfig(entity.id); const configVersion = config.version; // Lock version
return await withRetry(async () => { // Always use original version const lockedConfig = await getConfigVersion(config.id, configVersion); return await process(entity, lockedConfig); });}4. Silent Failures
Section titled “4. Silent Failures”Wrong:
// DON'T: Swallow errors without recordingasync function callExternalAPI() { try { return await externalClient.call(); } catch (error) { console.error('API call failed:', error); // Only logged return null; // Silent failure }}Right:
// DO: Record errors as Facts (Principle 8)async function callExternalAPI() { try { return await externalClient.call(); } catch (error) { // Record error Fact await appendFact({ type: 'error', subtype: 'external', timestamp: Date.now(), data: { error_code: error.code, error_message: error.message, error_type: determineErrorType(error) } });
throw error; // Don't swallow }}5. Infinite Retries
Section titled “5. Infinite Retries”Wrong:
// DON'T: Retry foreverasync function sendWebhook(url, data) { while (true) { try { return await fetch(url, { method: 'POST', body: data }); } catch (error) { await sleep(1000); // Retry forever } }}Right:
// DO: Set max attempts, use DLQasync function sendWebhook(url, data) { try { return await withRetry( () => fetch(url, { method: 'POST', body: data }), { maxAttempts: 5 } ); } catch (error) { // Send to DLQ after max retries await sendToDLQ({ url, data, error }); throw error; }}Summary
Section titled “Summary”| Concept | Implementation |
|---|---|
| Error Taxonomy | Transient (retry), Client (don’t retry), Server (maybe), External (retry with backoff) |
| Idempotency | Unique keys per operation, cached results, 24-hour TTL |
| Retry Strategy | Exponential backoff with jitter, max attempts, context-aware policies |
| Circuit Breaker | CLOSED/OPEN/HALF-OPEN states, failure thresholds, timeout recovery |
| Dead Letter Queue | Max retries exceeded → DLQ → manual intervention |
| Failure Facts | Errors are Facts (Principle 8), economic impact tracked (Principle 1) |
| Config Versioning | Lock to original version on retries (Principle 6) |
| Observability | Metrics, alerts, tracing, failure pattern analysis |
Error handling in z0 respects the principles: Facts are immutable (idempotency prevents duplicates), Errors are First-Class (tracked as Facts), Economics Must Close (costs recorded even on failure), and Configs Are Versioned (retries use original version). The system fails gracefully, recovers automatically where possible, and never loses data.