Error Handling & Retry Patterns

Graceful failure, automatic recovery, and data integrity for the z0 platform.

Prerequisites: PRINCIPLES.md (Principle 2: Facts Are Immutable, Principle 8: Errors Are First-Class), PRIMITIVES.md, ERROR-HANDLING.md

Overview

Error handling in z0 is not exceptional—it’s expected. Every external call can fail. Every network request can timeout. Every rate limit can be exceeded. The question is not “will this fail?” but “how will this fail safely?”

Principle	Error Handling Implication
Principle 2: Facts Are Immutable	Retries must be idempotent; can’t “undo” written Facts
Principle 8: Errors Are First-Class	Failures are Facts, tracked for economics and debugging
Principle 1: Economics Must Close	Failed operations with costs still need tracking
Principle 6: Configs Are Versioned	Retries must use original config_version

Key Insight: The z0 platform distinguishes between retryable errors (transient failures) and terminal errors (won’t succeed on retry). Retrying terminal errors wastes time and money. Not retrying transient errors loses revenue.

Error Taxonomy

Category	Retry?	Examples	HTTP Status
Transient	Yes	Network timeout, 503, rate limited	429, 503, 504
Client Error	No	400, 404, validation failure	400, 404, 422
Server Error	Maybe	500 (depends on idempotency)	500, 502
External Failure	Yes (with backoff)	Twilio down, CRM timeout	Varies
Data Integrity	No	Duplicate ID, constraint violation	409, 422
Authorization	No	Insufficient permissions, expired token	401, 403
Budget	No	Budget exhausted (Principle 10)	402

Decision Matrix

function isRetryable(error: APIError): boolean {
  // Transient errors: retry
  if ([429, 503, 504].includes(error.status)) {
    return true;
  }

  // Server errors: retry if idempotent
  if ([500, 502].includes(error.status)) {
    return true;  // Assuming idempotency key used
  }

  // Client errors: don't retry
  if (error.status >= 400 && error.status < 500) {
    return false;
  }

  // Network errors: retry
  if (error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET') {
    return true;
  }

  return false;
}

Fact Recording on Error

Per Principle 8, errors are recorded as Facts:

interface ErrorFact {
  type: 'error';
  subtype: 'tool_failure' | 'timeout' | 'validation' | 'external' | 'rate_limit';
  timestamp: number;

  tenant_id: string;
  entity_id: string;
  tool_id?: string;

  data: {
    error_code: string;
    error_message: string;
    error_type: string;  // From taxonomy
    retry_attempt: number;
    max_retries: number;
    resolved: boolean;
    resolution_fact_id?: string;
    trace_id?: string;

    // Economics (Principle 1: track costs even on failure)
    cost_incurred?: number;
    currency?: string;
  };
}

Example:

await appendFact({
  type: 'error',
  subtype: 'external',
  timestamp: Date.now(),
  tenant_id: 'ten_abc123',
  entity_id: 'asset_xyz789',
  tool_id: 'tool_twilio_voice',

  data: {
    error_code: 'ETIMEDOUT',
    error_message: 'Twilio API timeout after 30s',
    error_type: 'transient',
    retry_attempt: 2,
    max_retries: 5,
    resolved: false,
    cost_incurred: 0.005,  // Twilio charged us anyway
    currency: 'USD',
    trace_id: 'trace_abc123'
  }
});

Idempotency

The Problem

Facts are immutable (Principle 2). Retrying a Fact append without idempotency creates duplicates:

T1: Client sends Fact → timeout before response
T2: Client retries → Fact appended again
Result: Duplicate Facts, incorrect economics

Idempotency Keys

Every write operation must include an idempotency key:

interface IdempotencyKey {
  key: string;            // Unique per request
  expires_at: number;     // TTL for storage
  created_at: number;
  result?: any;           // Cached response
  status: 'pending' | 'completed' | 'failed';
}

Storage Mechanism

Critical: Idempotency keys MUST be stored in SQLite, not just in-memory. In-memory Maps are lost on DO eviction.

-- Add to DO schema initialization
CREATE TABLE IF NOT EXISTS idempotency_keys (
  key TEXT PRIMARY KEY,
  status TEXT NOT NULL CHECK(status IN ('pending', 'completed', 'failed')),
  result TEXT,                    -- JSON blob of cached response
  created_at INTEGER NOT NULL,
  expires_at INTEGER NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_idempotency_expiry
  ON idempotency_keys(expires_at);

Implementation Pattern

class DurableObjectWithIdempotency {
  // In-memory cache for hot path (backed by SQLite)
  private idempotencyHotCache: Map<string, IdempotencyKey> = new Map();

  async appendFact(fact: Fact, idempotencyKey: string): Promise<FactResult> {
    // 1. Check hot cache first (fast path)
    let existing = this.idempotencyHotCache.get(idempotencyKey);

    // 2. Fall back to SQLite (persistent storage)
    if (!existing) {
      const row = await this.sql.exec(
        `SELECT key, status, result, created_at, expires_at
         FROM idempotency_keys WHERE key = ?`,
        [idempotencyKey]
      ).first();

      if (row) {
        existing = {
          key: row.key,
          status: row.status,
          result: row.result ? JSON.parse(row.result) : undefined,
          created_at: row.created_at,
          expires_at: row.expires_at
        };
        // Populate hot cache
        this.idempotencyHotCache.set(idempotencyKey, existing);
      }
    }

    if (existing) {
      if (existing.status === 'pending') {
        // Request in progress - return 409 Conflict
        throw new IdempotencyConflictError(idempotencyKey);
      }
      // Return cached result (handles retry of completed request)
      return existing.result;
    }

    // 3. Mark as pending (persist to SQLite)
    const now = Date.now();
    const expiresAt = now + 86400000; // 24 hours

    await this.sql.exec(
      `INSERT INTO idempotency_keys (key, status, created_at, expires_at)
       VALUES (?, 'pending', ?, ?)`,
      [idempotencyKey, now, expiresAt]
    );

    this.idempotencyHotCache.set(idempotencyKey, {
      key: idempotencyKey,
      status: 'pending',
      created_at: now,
      expires_at: expiresAt
    });

    try {
      // 4. Append fact (immutable)
      const result = await this.doAppendFact(fact);

      // 5. Record success (persist to SQLite)
      await this.sql.exec(
        `UPDATE idempotency_keys
         SET status = 'completed', result = ?
         WHERE key = ?`,
        [JSON.stringify(result), idempotencyKey]
      );

      this.idempotencyHotCache.set(idempotencyKey, {
        key: idempotencyKey,
        status: 'completed',
        created_at: now,
        expires_at: expiresAt,
        result
      });

      return result;

    } catch (error) {
      // 6. Handle failure
      if (!isRetryable(error)) {
        // Terminal error - cache to prevent retries
        await this.sql.exec(
          `UPDATE idempotency_keys
           SET status = 'failed', result = ?
           WHERE key = ?`,
          [JSON.stringify({ error: error.message }), idempotencyKey]
        );

        this.idempotencyHotCache.set(idempotencyKey, {
          key: idempotencyKey,
          status: 'failed',
          created_at: now,
          expires_at: expiresAt,
          result: { error: error.message }
        });
      } else {
        // Retryable error - remove pending status to allow retry
        await this.sql.exec(
          `DELETE FROM idempotency_keys WHERE key = ?`,
          [idempotencyKey]
        );
        this.idempotencyHotCache.delete(idempotencyKey);
      }
      throw error;
    }
  }

  // Cleanup expired keys (called by reconciliation alarm)
  async cleanupIdempotencyCache(): Promise<void> {
    const now = Date.now();

    // Delete from SQLite
    await this.sql.exec(
      `DELETE FROM idempotency_keys WHERE expires_at < ?`,
      [now]
    );

    // Clear expired from hot cache
    for (const [key, value] of this.idempotencyHotCache.entries()) {
      if (value.expires_at < now) {
        this.idempotencyHotCache.delete(key);
      }
    }
  }
}

Client-Side Usage

async function createFactWithIdempotency(factData: Fact): Promise<FactResult> {
  // Generate idempotency key (deterministic per logical operation)
  const idempotencyKey = `fact_${factData.type}_${factData.timestamp}_${factData.entity_id}`;

  const response = await fetch('/v1/facts', {
    method: 'POST',
    headers: {
      'X-API-Key': apiKey,
      'Idempotency-Key': idempotencyKey,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify(factData)
  });

  if (response.ok) {
    return await response.json();
  }

  const error = await response.json();

  // If conflict, request in progress - wait and retry with SAME key
  if (response.status === 409 && error.type.includes('idempotency-conflict')) {
    await sleep(2000);
    return createFactWithIdempotency(factData);  // Retry with same key
  }

  throw new APIError(error);
}

Idempotency Key Generation

Operation	Key Format	Example
Fact append	`fact_{type}_{timestamp}_{entity_id}`	`fact_invocation_1705500000000_ent_abc123`
Entity create	`entity_{type}_{identifier}_{timestamp}`	`entity_asset_+15550001_1705500000000`
Config update	`config_{id}_{version}_{timestamp}`	`config_cfg_123_5_1705500000000`
External API	`external_{tool}_{operation}_{nonce}`	`external_twilio_call_abc123xyz`

Rules:

Deterministic: Same logical operation always generates same key
Unique: Different operations never collide
Bounded: Include timestamp to enable expiration
Human-readable: Debug-friendly format

Retry Strategies

Exponential Backoff with Jitter

interface RetryConfig {
  baseDelayMs: number;      // Starting delay (e.g., 1000)
  maxDelayMs: number;       // Ceiling (e.g., 60000)
  maxAttempts: number;      // Give up after N attempts
  jitterRatio: number;      // Random variance (0-1)
  backoffMultiplier: number; // Growth rate (typically 2)
}

function calculateBackoff(
  attempt: number,
  config: RetryConfig
): number {
  // Exponential: delay = base * multiplier^attempt
  const exponential = config.baseDelayMs * Math.pow(
    config.backoffMultiplier,
    attempt
  );

  // Cap at max delay
  const capped = Math.min(exponential, config.maxDelayMs);

  // Add jitter: random variance to prevent thundering herd
  const jitter = capped * config.jitterRatio * Math.random();

  return capped + jitter;
}

Example progression:

Attempt 1: 1000ms + jitter (0-500ms) = 1000-1500ms
Attempt 2: 2000ms + jitter (0-1000ms) = 2000-3000ms
Attempt 3: 4000ms + jitter (0-2000ms) = 4000-6000ms
Attempt 4: 8000ms + jitter (0-4000ms) = 8000-12000ms
Attempt 5: 16000ms + jitter (0-8000ms) = 16000-24000ms

Retry Wrapper

async function withRetry<T>(
  operation: () => Promise<T>,
  config: RetryConfig,
  context: {
    operationName: string;
    entityId?: string;
    tenantId?: string;
  }
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    try {
      // Attempt operation
      const result = await operation();

      // Success - record if this was a retry
      if (attempt > 0) {
        await recordRetrySuccess(context, attempt);
      }

      return result;

    } catch (error) {
      lastError = error;

      // Check if retryable
      if (!isRetryable(error)) {
        await recordTerminalError(context, error, attempt);
        throw error;
      }

      // Check if final attempt
      if (attempt === config.maxAttempts - 1) {
        await recordMaxRetriesExceeded(context, error, attempt);
        throw new MaxRetriesExceededError(
          `${context.operationName} failed after ${config.maxAttempts} attempts`,
          lastError
        );
      }

      // Calculate backoff
      const delayMs = calculateBackoff(attempt, config);

      // Record retry
      await recordRetryAttempt(context, error, attempt, delayMs);

      // Wait before retry
      await sleep(delayMs);
    }
  }

  throw lastError;
}

Retry Policies by Operation

Operation	Base Delay	Max Delay	Max Attempts	Rationale
Fact append	1000ms	60000ms	3	Fast fail, Facts critical
Entity create	1000ms	60000ms	3	Low retry, avoid duplicates
Config update	1000ms	30000ms	3	Fast fail, conflicts likely
External API (Twilio)	2000ms	120000ms	5	Higher tolerance, expensive to lose
Webhook delivery	5000ms	300000ms	5	Eventual consistency OK
CRM sync	10000ms	600000ms	10	Very eventual, high value

Context-Aware Retry

Different contexts need different retry behavior:

function getRetryConfig(context: OperationContext): RetryConfig {
  // Real-time operations: fail fast
  if (context.priority === 'realtime') {
    return {
      baseDelayMs: 500,
      maxDelayMs: 5000,
      maxAttempts: 2,
      jitterRatio: 0.5,
      backoffMultiplier: 2
    };
  }

  // Background operations: tolerate more delay
  if (context.priority === 'background') {
    return {
      baseDelayMs: 5000,
      maxDelayMs: 300000,
      maxAttempts: 10,
      jitterRatio: 0.5,
      backoffMultiplier: 2
    };
  }

  // Default: balanced
  return {
    baseDelayMs: 1000,
    maxDelayMs: 60000,
    maxAttempts: 5,
    jitterRatio: 0.5,
    backoffMultiplier: 2
  };
}

Circuit Breaker

Circuit breakers prevent cascading failures when external services are down.

States

CLOSED → Normal operation, requests pass through
    │
    ├─ Too many failures → OPEN
    │
OPEN → Fast-fail all requests, don't call service
    │
    ├─ Timeout elapsed → HALF-OPEN
    │
HALF-OPEN → Allow one test request
    │
    ├─ Success → CLOSED
    └─ Failure → OPEN

Implementation

interface CircuitBreakerConfig {
  failureThreshold: number;      // Open after N failures (e.g., 5)
  successThreshold: number;      // Close after N successes (e.g., 2)
  timeout: number;               // Stay open for N ms (e.g., 60000)
  volumeThreshold: number;       // Min requests before opening (e.g., 10)
}

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF-OPEN' = 'CLOSED';
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime = 0;
  private requestCount = 0;

  constructor(
    private name: string,
    private config: CircuitBreakerConfig
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check circuit state
    if (this.state === 'OPEN') {
      // Check if timeout elapsed
      if (Date.now() - this.lastFailureTime > this.config.timeout) {
        this.state = 'HALF-OPEN';
        this.successCount = 0;
      } else {
        throw new CircuitOpenError(
          `Circuit breaker ${this.name} is OPEN`
        );
      }
    }

    try {
      // Execute operation
      const result = await operation();

      // Record success
      this.onSuccess();

      return result;

    } catch (error) {
      // Record failure
      this.onFailure();

      throw error;
    }
  }

  private onSuccess(): void {
    this.requestCount++;

    if (this.state === 'HALF-OPEN') {
      this.successCount++;

      if (this.successCount >= this.config.successThreshold) {
        // Enough successes - close circuit
        this.state = 'CLOSED';
        this.failureCount = 0;
        this.successCount = 0;
        this.recordStateChange('CLOSED', 'threshold_met');
      }
    } else if (this.state === 'CLOSED') {
      // Reset failure count on success
      this.failureCount = 0;
    }
  }

  private onFailure(): void {
    this.requestCount++;
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.state === 'HALF-OPEN') {
      // Test request failed - reopen
      this.state = 'OPEN';
      this.successCount = 0;
      this.recordStateChange('OPEN', 'half_open_failed');

    } else if (this.state === 'CLOSED') {
      // Check if should open
      if (
        this.requestCount >= this.config.volumeThreshold &&
        this.failureCount >= this.config.failureThreshold
      ) {
        this.state = 'OPEN';
        this.recordStateChange('OPEN', 'threshold_exceeded');
      }
    }
  }

  private async recordStateChange(
    newState: string,
    reason: string
  ): Promise<void> {
    // Record circuit breaker state change as Fact
    await appendFact({
      type: 'lifecycle',
      subtype: 'circuit_breaker_state_changed',
      timestamp: Date.now(),
      data: {
        circuit_name: this.name,
        previous_state: this.state,
        new_state: newState,
        reason,
        failure_count: this.failureCount,
        success_count: this.successCount,
        request_count: this.requestCount
      }
    });
  }

  getState(): string {
    return this.state;
  }

  getMetrics() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
      requestCount: this.requestCount,
      lastFailureTime: this.lastFailureTime
    };
  }
}

Circuit Breaker Registry

class CircuitBreakerRegistry {
  private breakers: Map<string, CircuitBreaker> = new Map();

  get(name: string, config?: CircuitBreakerConfig): CircuitBreaker {
    if (!this.breakers.has(name)) {
      const defaultConfig: CircuitBreakerConfig = {
        failureThreshold: 5,
        successThreshold: 2,
        timeout: 60000,
        volumeThreshold: 10
      };

      this.breakers.set(
        name,
        new CircuitBreaker(name, config || defaultConfig)
      );
    }

    return this.breakers.get(name)!;
  }

  getMetrics(): Record<string, any> {
    const metrics: Record<string, any> = {};
    for (const [name, breaker] of this.breakers.entries()) {
      metrics[name] = breaker.getMetrics();
    }
    return metrics;
  }
}

// Global registry
const circuitBreakers = new CircuitBreakerRegistry();

Usage with External APIs

async function callTwilioAPI(params: TwilioCallParams): Promise<CallResult> {
  const breaker = circuitBreakers.get('twilio_voice', {
    failureThreshold: 5,
    successThreshold: 2,
    timeout: 60000,
    volumeThreshold: 10
  });

  try {
    return await breaker.execute(async () => {
      // Wrapped with retry logic
      return await withRetry(
        () => twilioClient.calls.create(params),
        {
          baseDelayMs: 2000,
          maxDelayMs: 120000,
          maxAttempts: 5,
          jitterRatio: 0.5,
          backoffMultiplier: 2
        },
        {
          operationName: 'twilio_call_create',
          entityId: params.assetId,
          tenantId: params.tenantId
        }
      );
    });

  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Circuit breaker is open - record as error Fact
      await appendFact({
        type: 'error',
        subtype: 'external',
        timestamp: Date.now(),
        tool_id: 'tool_twilio_voice',
        data: {
          error_code: 'CIRCUIT_OPEN',
          error_message: 'Twilio circuit breaker is open',
          error_type: 'circuit_breaker',
          circuit_state: 'OPEN',
          retry_attempt: 0,
          max_retries: 0,
          resolved: false
        }
      });

      // Use fallback behavior
      return handleTwilioUnavailable(params);
    }

    throw error;
  }
}

Dead Letter Queue

When all retries fail, messages go to the dead letter queue for manual intervention.

DLQ Pattern

interface DeadLetterMessage {
  id: string;
  original_queue: string;
  message: any;
  error: {
    code: string;
    message: string;
    stack?: string;
  };
  attempts: number;
  first_attempt_at: number;
  last_attempt_at: number;
  sent_to_dlq_at: number;
  metadata: {
    tenant_id?: string;
    entity_id?: string;
    trace_id?: string;
  };
}

async function handleMaxRetriesExceeded(
  message: QueueMessage,
  error: Error,
  attempts: number
): Promise<void> {
  // 1. Record failure as Fact (Principle 8: Errors Are First-Class)
  await appendFact({
    type: 'error',
    subtype: 'max_retries_exceeded',
    timestamp: Date.now(),
    tenant_id: message.tenantId,
    entity_id: message.entityId,
    data: {
      error_code: error.code || 'UNKNOWN',
      error_message: error.message,
      error_type: 'terminal',
      retry_attempt: attempts,
      max_retries: attempts,
      resolved: false,
      queue_name: message.queue,
      message_id: message.id
    }
  });

  // 2. Send to dead letter queue
  const dlqMessage: DeadLetterMessage = {
    id: `dlq_${message.id}`,
    original_queue: message.queue,
    message: message.body,
    error: {
      code: error.code || 'UNKNOWN',
      message: error.message,
      stack: error.stack
    },
    attempts,
    first_attempt_at: message.timestamp,
    last_attempt_at: Date.now(),
    sent_to_dlq_at: Date.now(),
    metadata: {
      tenant_id: message.tenantId,
      entity_id: message.entityId,
      trace_id: message.traceId
    }
  };

  await env.DLQ.send(dlqMessage);

  // 3. Alert operations team
  await alertOps({
    severity: 'warning',
    title: 'Message sent to DLQ',
    message: `Queue ${message.queue} message ${message.id} failed after ${attempts} attempts`,
    metadata: dlqMessage.metadata
  });
}

DLQ Processing Worker

export default {
  async queue(batch: MessageBatch<DeadLetterMessage>, env: Env): Promise<void> {
    for (const message of batch.messages) {
      try {
        // Analyze failure pattern
        const pattern = await analyzeFailurePattern(message);

        if (pattern.type === 'transient_resolved') {
          // Service recovered - retry
          await retryFromDLQ(message, env);
        } else if (pattern.type === 'configuration_error') {
          // Alert for manual fix
          await alertForManualIntervention(message, pattern);
        } else if (pattern.type === 'data_corruption') {
          // Archive for analysis
          await archiveCorruptedMessage(message);
        }

        message.ack();

      } catch (error) {
        // DLQ processing failed - log and continue
        console.error('DLQ processing error:', error);
        message.retry();
      }
    }
  }
};

DLQ Retry Interface

async function retryFromDLQ(
  dlqMessage: DeadLetterMessage,
  env: Env
): Promise<void> {
  // Record retry decision
  await appendFact({
    type: 'lifecycle',
    subtype: 'dlq_retry',
    timestamp: Date.now(),
    data: {
      dlq_message_id: dlqMessage.id,
      original_queue: dlqMessage.original_queue,
      reason: 'manual_retry'
    }
  });

  // Re-enqueue to original queue
  await env.QUEUES[dlqMessage.original_queue].send(
    dlqMessage.message,
    {
      contentType: 'json',
      headers: {
        'X-Retry-From-DLQ': 'true',
        'X-Original-Message-ID': dlqMessage.id,
        'X-DLQ-Attempts': dlqMessage.attempts.toString()
      }
    }
  );
}

Failure Facts

Per Principle 1 (Economics Must Close the Loop), failures with economic impact need Facts.

Cost Recording on Failure

async function recordFailureWithCost(
  operation: string,
  error: Error,
  context: {
    tool_id: string;
    asset_id: string;
    tenant_id: string;
    config_id: string;
    config_version: number;
  },
  cost: {
    amount: number;
    currency: string;
  }
): Promise<void> {
  // 1. Record error Fact
  const errorFactId = await appendFact({
    type: 'error',
    subtype: 'tool_failure',
    timestamp: Date.now(),
    tenant_id: context.tenant_id,
    tool_id: context.tool_id,
    asset_id: context.asset_id,
    config_id: context.config_id,
    config_version: context.config_version,
    data: {
      error_code: error.code || 'UNKNOWN',
      error_message: error.message,
      error_type: determineErrorType(error),
      operation,
      resolved: false
    }
  });

  // 2. Record cost Fact (even though operation failed)
  await appendFact({
    type: 'cost',
    subtype: 'tool_usage',
    timestamp: Date.now(),
    tenant_id: context.tenant_id,
    tool_id: context.tool_id,
    asset_id: context.asset_id,
    from_entity: context.tenant_id,
    to_entity: 'vendor_twilio',
    amount: cost.amount,
    currency: cost.currency,
    config_id: context.config_id,
    config_version: context.config_version,
    data: {
      operation,
      failed: true,
      error_fact_id: errorFactId
    }
  });
}

Example: Twilio Call Failure

async function handleTwilioCallFailure(
  callParams: TwilioCallParams,
  error: TwilioError
): Promise<void> {
  // Twilio charges us for failed calls
  const cost = calculateTwilioCost(callParams.duration || 0);

  await recordFailureWithCost(
    'twilio_call_create',
    error,
    {
      tool_id: 'tool_twilio_voice',
      asset_id: callParams.assetId,
      tenant_id: callParams.tenantId,
      config_id: callParams.configId,
      config_version: callParams.configVersion
    },
    {
      amount: cost,
      currency: 'USD'
    }
  );

  // Economic loop still closes - we track the cost even though call failed
}

Config Versioning on Retry

Per Principle 6 (Configs Are Versioned), retries must use the original config_version.

Problem

T1: Client sends request using pricing Config v3
T2: Request times out
T3: Pricing Config updated to v4
T4: Retry uses v4 (WRONG)
Result: Inconsistent pricing applied to same operation

Solution

interface RetryContext {
  operation: string;
  idempotencyKey: string;
  config_id: string;
  config_version: number;  // Lock to original version
  tenant_id: string;
  entity_id: string;
  created_at: number;
}

async function executeWithConfigVersion<T>(
  operation: () => Promise<T>,
  context: RetryContext,
  retryConfig: RetryConfig
): Promise<T> {
  return await withRetry(
    async () => {
      // Always use original config_version
      const config = await getConfigVersion(
        context.config_id,
        context.config_version
      );

      // Execute with locked config
      return await operation();
    },
    retryConfig,
    {
      operationName: context.operation,
      entityId: context.entity_id,
      tenantId: context.tenant_id
    }
  );
}

Fact Recording with Config Version

async function appendFactWithRetry(
  fact: Fact,
  idempotencyKey: string
): Promise<FactResult> {
  // Lock config version at operation start
  const context: RetryContext = {
    operation: 'append_fact',
    idempotencyKey,
    config_id: fact.config_id,
    config_version: fact.config_version,  // Lock to this version
    tenant_id: fact.tenant_id,
    entity_id: fact.entity_id,
    created_at: Date.now()
  };

  return await executeWithConfigVersion(
    () => durableObject.appendFact(fact, idempotencyKey),
    context,
    {
      baseDelayMs: 1000,
      maxDelayMs: 60000,
      maxAttempts: 3,
      jitterRatio: 0.5,
      backoffMultiplier: 2
    }
  );
}

Observability

Metrics

Track error patterns for operational insight:

interface ErrorMetrics {
  // Counters
  errors_total: Counter;
  errors_by_type: Counter;
  errors_by_tool: Counter;
  retries_total: Counter;
  retries_successful: Counter;
  circuit_breaker_opens: Counter;
  dlq_messages: Counter;

  // Histograms
  retry_duration: Histogram;
  retry_attempts: Histogram;

  // Gauges
  circuit_breaker_state: Gauge;
  dlq_depth: Gauge;
}

// Record error
metrics.errors_total.inc();
metrics.errors_by_type.inc({ type: error.type });
metrics.errors_by_tool.inc({ tool_id: error.tool_id });

// Record retry
metrics.retries_total.inc();
metrics.retry_duration.observe(duration);
metrics.retry_attempts.observe(attempts);

if (success) {
  metrics.retries_successful.inc();
}

Alerts

Configure alerts for error patterns:

const alertRules = [
  {
    name: 'high_error_rate',
    condition: 'errors_total rate > 0.01',  // 1% error rate
    severity: 'warning',
    message: 'Error rate exceeds threshold'
  },
  {
    name: 'circuit_breaker_open',
    condition: 'circuit_breaker_state == OPEN',
    severity: 'critical',
    message: 'Circuit breaker open for {circuit_name}'
  },
  {
    name: 'dlq_growing',
    condition: 'dlq_depth > 100 AND rate(dlq_depth) > 0',
    severity: 'warning',
    message: 'Dead letter queue growing'
  },
  {
    name: 'retry_exhaustion',
    condition: 'rate(dlq_messages) > 0.001',  // Messages hitting DLQ
    severity: 'warning',
    message: 'Messages failing all retries'
  }
];

Tracing

Link errors across retries with trace IDs:

interface TraceContext {
  trace_id: string;
  span_id: string;
  parent_span_id?: string;
  operation: string;
  start_time: number;
}

async function tracedOperation<T>(
  operation: () => Promise<T>,
  context: TraceContext
): Promise<T> {
  const start = Date.now();

  try {
    const result = await operation();

    // Record successful span
    await recordSpan({
      ...context,
      duration: Date.now() - start,
      status: 'success'
    });

    return result;

  } catch (error) {
    // Record error span
    await recordSpan({
      ...context,
      duration: Date.now() - start,
      status: 'error',
      error: {
        code: error.code,
        message: error.message
      }
    });

    throw error;
  }
}

Anti-Patterns

1. Retrying Terminal Errors

Wrong:

// DON'T: Retry validation errors
async function createEntity(data) {
  return await withRetry(
    () => fetch('/v1/entities', { method: 'POST', body: data }),
    { maxAttempts: 5 }  // Will retry 400 errors
  );
}

Right:

// DO: Only retry transient errors
async function createEntity(data) {
  const response = await fetch('/v1/entities', {
    method: 'POST',
    body: data
  });

  if (!response.ok) {
    const error = await response.json();

    // Don't retry client errors
    if (response.status >= 400 && response.status < 500) {
      throw new ValidationError(error);
    }

    // Retry server errors
    if (response.status >= 500) {
      return await withRetry(
        () => fetch('/v1/entities', { method: 'POST', body: data }),
        { maxAttempts: 3 }
      );
    }
  }

  return await response.json();
}

2. Missing Idempotency Keys

Wrong:

// DON'T: Retry without idempotency
async function appendFact(fact) {
  return await withRetry(
    () => durableObject.appendFact(fact),  // No idempotency key
    { maxAttempts: 3 }
  );
  // Can create duplicate Facts on timeout
}

Right:

// DO: Always use idempotency keys
async function appendFact(fact) {
  const idempotencyKey = generateIdempotencyKey(fact);

  return await withRetry(
    () => durableObject.appendFact(fact, idempotencyKey),
    { maxAttempts: 3 }
  );
}

3. Ignoring Config Version

Wrong:

// DON'T: Fetch latest config on retry
async function processWithConfig(entity) {
  return await withRetry(async () => {
    const config = await getLatestConfig(entity.id);  // Might change between retries
    return await process(entity, config);
  });
}

Right:

// DO: Lock config version at operation start
async function processWithConfig(entity) {
  const config = await getLatestConfig(entity.id);
  const configVersion = config.version;  // Lock version

  return await withRetry(async () => {
    // Always use original version
    const lockedConfig = await getConfigVersion(config.id, configVersion);
    return await process(entity, lockedConfig);
  });
}

4. Silent Failures

Wrong:

// DON'T: Swallow errors without recording
async function callExternalAPI() {
  try {
    return await externalClient.call();
  } catch (error) {
    console.error('API call failed:', error);  // Only logged
    return null;  // Silent failure
  }
}

Right:

// DO: Record errors as Facts (Principle 8)
async function callExternalAPI() {
  try {
    return await externalClient.call();
  } catch (error) {
    // Record error Fact
    await appendFact({
      type: 'error',
      subtype: 'external',
      timestamp: Date.now(),
      data: {
        error_code: error.code,
        error_message: error.message,
        error_type: determineErrorType(error)
      }
    });

    throw error;  // Don't swallow
  }
}

5. Infinite Retries

Wrong:

// DON'T: Retry forever
async function sendWebhook(url, data) {
  while (true) {
    try {
      return await fetch(url, { method: 'POST', body: data });
    } catch (error) {
      await sleep(1000);  // Retry forever
    }
  }
}

Right:

// DO: Set max attempts, use DLQ
async function sendWebhook(url, data) {
  try {
    return await withRetry(
      () => fetch(url, { method: 'POST', body: data }),
      { maxAttempts: 5 }
    );
  } catch (error) {
    // Send to DLQ after max retries
    await sendToDLQ({ url, data, error });
    throw error;
  }
}

Summary

Concept	Implementation
Error Taxonomy	Transient (retry), Client (don’t retry), Server (maybe), External (retry with backoff)
Idempotency	Unique keys per operation, cached results, 24-hour TTL
Retry Strategy	Exponential backoff with jitter, max attempts, context-aware policies
Circuit Breaker	CLOSED/OPEN/HALF-OPEN states, failure thresholds, timeout recovery
Dead Letter Queue	Max retries exceeded → DLQ → manual intervention
Failure Facts	Errors are Facts (Principle 8), economic impact tracked (Principle 1)
Config Versioning	Lock to original version on retries (Principle 6)
Observability	Metrics, alerts, tracing, failure pattern analysis

Error handling in z0 respects the principles: Facts are immutable (idempotency prevents duplicates), Errors are First-Class (tracked as Facts), Economics Must Close (costs recorded even on failure), and Configs Are Versioned (retries use original version). The system fails gracefully, recovers automatically where possible, and never loses data.

Error Handling & Retry Patterns

Overview

Error Taxonomy

Categories

Decision Matrix

Fact Recording on Error

Idempotency

The Problem

Idempotency Keys

Storage Mechanism

Implementation Pattern

Client-Side Usage

Idempotency Key Generation

Retry Strategies

Exponential Backoff with Jitter

Retry Wrapper

Retry Policies by Operation

Context-Aware Retry

Circuit Breaker

States

Implementation

Circuit Breaker Registry

Usage with External APIs

Dead Letter Queue

DLQ Pattern

DLQ Processing Worker

DLQ Retry Interface

Failure Facts

Cost Recording on Failure

Example: Twilio Call Failure

Config Versioning on Retry

Problem

Solution

Fact Recording with Config Version

Observability

Metrics

Alerts

Tracing

Anti-Patterns

1. Retrying Terminal Errors

2. Missing Idempotency Keys

3. Ignoring Config Version

4. Silent Failures

5. Infinite Retries

Summary