Skip to content

Error Handling & Retry Patterns

Graceful failure, automatic recovery, and data integrity for the z0 platform.

Prerequisites: PRINCIPLES.md (Principle 2: Facts Are Immutable, Principle 8: Errors Are First-Class), PRIMITIVES.md, ERROR-HANDLING.md


Error handling in z0 is not exceptional—it’s expected. Every external call can fail. Every network request can timeout. Every rate limit can be exceeded. The question is not “will this fail?” but “how will this fail safely?”

PrincipleError Handling Implication
Principle 2: Facts Are ImmutableRetries must be idempotent; can’t “undo” written Facts
Principle 8: Errors Are First-ClassFailures are Facts, tracked for economics and debugging
Principle 1: Economics Must CloseFailed operations with costs still need tracking
Principle 6: Configs Are VersionedRetries must use original config_version

Key Insight: The z0 platform distinguishes between retryable errors (transient failures) and terminal errors (won’t succeed on retry). Retrying terminal errors wastes time and money. Not retrying transient errors loses revenue.


CategoryRetry?ExamplesHTTP Status
TransientYesNetwork timeout, 503, rate limited429, 503, 504
Client ErrorNo400, 404, validation failure400, 404, 422
Server ErrorMaybe500 (depends on idempotency)500, 502
External FailureYes (with backoff)Twilio down, CRM timeoutVaries
Data IntegrityNoDuplicate ID, constraint violation409, 422
AuthorizationNoInsufficient permissions, expired token401, 403
BudgetNoBudget exhausted (Principle 10)402
function isRetryable(error: APIError): boolean {
// Transient errors: retry
if ([429, 503, 504].includes(error.status)) {
return true;
}
// Server errors: retry if idempotent
if ([500, 502].includes(error.status)) {
return true; // Assuming idempotency key used
}
// Client errors: don't retry
if (error.status >= 400 && error.status < 500) {
return false;
}
// Network errors: retry
if (error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET') {
return true;
}
return false;
}

Per Principle 8, errors are recorded as Facts:

interface ErrorFact {
type: 'error';
subtype: 'tool_failure' | 'timeout' | 'validation' | 'external' | 'rate_limit';
timestamp: number;
tenant_id: string;
entity_id: string;
tool_id?: string;
data: {
error_code: string;
error_message: string;
error_type: string; // From taxonomy
retry_attempt: number;
max_retries: number;
resolved: boolean;
resolution_fact_id?: string;
trace_id?: string;
// Economics (Principle 1: track costs even on failure)
cost_incurred?: number;
currency?: string;
};
}

Example:

await appendFact({
type: 'error',
subtype: 'external',
timestamp: Date.now(),
tenant_id: 'ten_abc123',
entity_id: 'asset_xyz789',
tool_id: 'tool_twilio_voice',
data: {
error_code: 'ETIMEDOUT',
error_message: 'Twilio API timeout after 30s',
error_type: 'transient',
retry_attempt: 2,
max_retries: 5,
resolved: false,
cost_incurred: 0.005, // Twilio charged us anyway
currency: 'USD',
trace_id: 'trace_abc123'
}
});

Facts are immutable (Principle 2). Retrying a Fact append without idempotency creates duplicates:

T1: Client sends Fact → timeout before response
T2: Client retries → Fact appended again
Result: Duplicate Facts, incorrect economics

Every write operation must include an idempotency key:

interface IdempotencyKey {
key: string; // Unique per request
expires_at: number; // TTL for storage
created_at: number;
result?: any; // Cached response
status: 'pending' | 'completed' | 'failed';
}

Critical: Idempotency keys MUST be stored in SQLite, not just in-memory. In-memory Maps are lost on DO eviction.

-- Add to DO schema initialization
CREATE TABLE IF NOT EXISTS idempotency_keys (
key TEXT PRIMARY KEY,
status TEXT NOT NULL CHECK(status IN ('pending', 'completed', 'failed')),
result TEXT, -- JSON blob of cached response
created_at INTEGER NOT NULL,
expires_at INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_idempotency_expiry
ON idempotency_keys(expires_at);
class DurableObjectWithIdempotency {
// In-memory cache for hot path (backed by SQLite)
private idempotencyHotCache: Map<string, IdempotencyKey> = new Map();
async appendFact(fact: Fact, idempotencyKey: string): Promise<FactResult> {
// 1. Check hot cache first (fast path)
let existing = this.idempotencyHotCache.get(idempotencyKey);
// 2. Fall back to SQLite (persistent storage)
if (!existing) {
const row = await this.sql.exec(
`SELECT key, status, result, created_at, expires_at
FROM idempotency_keys WHERE key = ?`,
[idempotencyKey]
).first();
if (row) {
existing = {
key: row.key,
status: row.status,
result: row.result ? JSON.parse(row.result) : undefined,
created_at: row.created_at,
expires_at: row.expires_at
};
// Populate hot cache
this.idempotencyHotCache.set(idempotencyKey, existing);
}
}
if (existing) {
if (existing.status === 'pending') {
// Request in progress - return 409 Conflict
throw new IdempotencyConflictError(idempotencyKey);
}
// Return cached result (handles retry of completed request)
return existing.result;
}
// 3. Mark as pending (persist to SQLite)
const now = Date.now();
const expiresAt = now + 86400000; // 24 hours
await this.sql.exec(
`INSERT INTO idempotency_keys (key, status, created_at, expires_at)
VALUES (?, 'pending', ?, ?)`,
[idempotencyKey, now, expiresAt]
);
this.idempotencyHotCache.set(idempotencyKey, {
key: idempotencyKey,
status: 'pending',
created_at: now,
expires_at: expiresAt
});
try {
// 4. Append fact (immutable)
const result = await this.doAppendFact(fact);
// 5. Record success (persist to SQLite)
await this.sql.exec(
`UPDATE idempotency_keys
SET status = 'completed', result = ?
WHERE key = ?`,
[JSON.stringify(result), idempotencyKey]
);
this.idempotencyHotCache.set(idempotencyKey, {
key: idempotencyKey,
status: 'completed',
created_at: now,
expires_at: expiresAt,
result
});
return result;
} catch (error) {
// 6. Handle failure
if (!isRetryable(error)) {
// Terminal error - cache to prevent retries
await this.sql.exec(
`UPDATE idempotency_keys
SET status = 'failed', result = ?
WHERE key = ?`,
[JSON.stringify({ error: error.message }), idempotencyKey]
);
this.idempotencyHotCache.set(idempotencyKey, {
key: idempotencyKey,
status: 'failed',
created_at: now,
expires_at: expiresAt,
result: { error: error.message }
});
} else {
// Retryable error - remove pending status to allow retry
await this.sql.exec(
`DELETE FROM idempotency_keys WHERE key = ?`,
[idempotencyKey]
);
this.idempotencyHotCache.delete(idempotencyKey);
}
throw error;
}
}
// Cleanup expired keys (called by reconciliation alarm)
async cleanupIdempotencyCache(): Promise<void> {
const now = Date.now();
// Delete from SQLite
await this.sql.exec(
`DELETE FROM idempotency_keys WHERE expires_at < ?`,
[now]
);
// Clear expired from hot cache
for (const [key, value] of this.idempotencyHotCache.entries()) {
if (value.expires_at < now) {
this.idempotencyHotCache.delete(key);
}
}
}
}
async function createFactWithIdempotency(factData: Fact): Promise<FactResult> {
// Generate idempotency key (deterministic per logical operation)
const idempotencyKey = `fact_${factData.type}_${factData.timestamp}_${factData.entity_id}`;
const response = await fetch('/v1/facts', {
method: 'POST',
headers: {
'X-API-Key': apiKey,
'Idempotency-Key': idempotencyKey,
'Content-Type': 'application/json'
},
body: JSON.stringify(factData)
});
if (response.ok) {
return await response.json();
}
const error = await response.json();
// If conflict, request in progress - wait and retry with SAME key
if (response.status === 409 && error.type.includes('idempotency-conflict')) {
await sleep(2000);
return createFactWithIdempotency(factData); // Retry with same key
}
throw new APIError(error);
}
OperationKey FormatExample
Fact appendfact_{type}_{timestamp}_{entity_id}fact_invocation_1705500000000_ent_abc123
Entity createentity_{type}_{identifier}_{timestamp}entity_asset_+15550001_1705500000000
Config updateconfig_{id}_{version}_{timestamp}config_cfg_123_5_1705500000000
External APIexternal_{tool}_{operation}_{nonce}external_twilio_call_abc123xyz

Rules:

  1. Deterministic: Same logical operation always generates same key
  2. Unique: Different operations never collide
  3. Bounded: Include timestamp to enable expiration
  4. Human-readable: Debug-friendly format

interface RetryConfig {
baseDelayMs: number; // Starting delay (e.g., 1000)
maxDelayMs: number; // Ceiling (e.g., 60000)
maxAttempts: number; // Give up after N attempts
jitterRatio: number; // Random variance (0-1)
backoffMultiplier: number; // Growth rate (typically 2)
}
function calculateBackoff(
attempt: number,
config: RetryConfig
): number {
// Exponential: delay = base * multiplier^attempt
const exponential = config.baseDelayMs * Math.pow(
config.backoffMultiplier,
attempt
);
// Cap at max delay
const capped = Math.min(exponential, config.maxDelayMs);
// Add jitter: random variance to prevent thundering herd
const jitter = capped * config.jitterRatio * Math.random();
return capped + jitter;
}

Example progression:

Attempt 1: 1000ms + jitter (0-500ms) = 1000-1500ms
Attempt 2: 2000ms + jitter (0-1000ms) = 2000-3000ms
Attempt 3: 4000ms + jitter (0-2000ms) = 4000-6000ms
Attempt 4: 8000ms + jitter (0-4000ms) = 8000-12000ms
Attempt 5: 16000ms + jitter (0-8000ms) = 16000-24000ms
async function withRetry<T>(
operation: () => Promise<T>,
config: RetryConfig,
context: {
operationName: string;
entityId?: string;
tenantId?: string;
}
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
// Attempt operation
const result = await operation();
// Success - record if this was a retry
if (attempt > 0) {
await recordRetrySuccess(context, attempt);
}
return result;
} catch (error) {
lastError = error;
// Check if retryable
if (!isRetryable(error)) {
await recordTerminalError(context, error, attempt);
throw error;
}
// Check if final attempt
if (attempt === config.maxAttempts - 1) {
await recordMaxRetriesExceeded(context, error, attempt);
throw new MaxRetriesExceededError(
`${context.operationName} failed after ${config.maxAttempts} attempts`,
lastError
);
}
// Calculate backoff
const delayMs = calculateBackoff(attempt, config);
// Record retry
await recordRetryAttempt(context, error, attempt, delayMs);
// Wait before retry
await sleep(delayMs);
}
}
throw lastError;
}
OperationBase DelayMax DelayMax AttemptsRationale
Fact append1000ms60000ms3Fast fail, Facts critical
Entity create1000ms60000ms3Low retry, avoid duplicates
Config update1000ms30000ms3Fast fail, conflicts likely
External API (Twilio)2000ms120000ms5Higher tolerance, expensive to lose
Webhook delivery5000ms300000ms5Eventual consistency OK
CRM sync10000ms600000ms10Very eventual, high value

Different contexts need different retry behavior:

function getRetryConfig(context: OperationContext): RetryConfig {
// Real-time operations: fail fast
if (context.priority === 'realtime') {
return {
baseDelayMs: 500,
maxDelayMs: 5000,
maxAttempts: 2,
jitterRatio: 0.5,
backoffMultiplier: 2
};
}
// Background operations: tolerate more delay
if (context.priority === 'background') {
return {
baseDelayMs: 5000,
maxDelayMs: 300000,
maxAttempts: 10,
jitterRatio: 0.5,
backoffMultiplier: 2
};
}
// Default: balanced
return {
baseDelayMs: 1000,
maxDelayMs: 60000,
maxAttempts: 5,
jitterRatio: 0.5,
backoffMultiplier: 2
};
}

Circuit breakers prevent cascading failures when external services are down.

CLOSED → Normal operation, requests pass through
├─ Too many failures → OPEN
OPEN → Fast-fail all requests, don't call service
├─ Timeout elapsed → HALF-OPEN
HALF-OPEN → Allow one test request
├─ Success → CLOSED
└─ Failure → OPEN
interface CircuitBreakerConfig {
failureThreshold: number; // Open after N failures (e.g., 5)
successThreshold: number; // Close after N successes (e.g., 2)
timeout: number; // Stay open for N ms (e.g., 60000)
volumeThreshold: number; // Min requests before opening (e.g., 10)
}
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF-OPEN' = 'CLOSED';
private failureCount = 0;
private successCount = 0;
private lastFailureTime = 0;
private requestCount = 0;
constructor(
private name: string,
private config: CircuitBreakerConfig
) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
// Check circuit state
if (this.state === 'OPEN') {
// Check if timeout elapsed
if (Date.now() - this.lastFailureTime > this.config.timeout) {
this.state = 'HALF-OPEN';
this.successCount = 0;
} else {
throw new CircuitOpenError(
`Circuit breaker ${this.name} is OPEN`
);
}
}
try {
// Execute operation
const result = await operation();
// Record success
this.onSuccess();
return result;
} catch (error) {
// Record failure
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.requestCount++;
if (this.state === 'HALF-OPEN') {
this.successCount++;
if (this.successCount >= this.config.successThreshold) {
// Enough successes - close circuit
this.state = 'CLOSED';
this.failureCount = 0;
this.successCount = 0;
this.recordStateChange('CLOSED', 'threshold_met');
}
} else if (this.state === 'CLOSED') {
// Reset failure count on success
this.failureCount = 0;
}
}
private onFailure(): void {
this.requestCount++;
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.state === 'HALF-OPEN') {
// Test request failed - reopen
this.state = 'OPEN';
this.successCount = 0;
this.recordStateChange('OPEN', 'half_open_failed');
} else if (this.state === 'CLOSED') {
// Check if should open
if (
this.requestCount >= this.config.volumeThreshold &&
this.failureCount >= this.config.failureThreshold
) {
this.state = 'OPEN';
this.recordStateChange('OPEN', 'threshold_exceeded');
}
}
}
private async recordStateChange(
newState: string,
reason: string
): Promise<void> {
// Record circuit breaker state change as Fact
await appendFact({
type: 'lifecycle',
subtype: 'circuit_breaker_state_changed',
timestamp: Date.now(),
data: {
circuit_name: this.name,
previous_state: this.state,
new_state: newState,
reason,
failure_count: this.failureCount,
success_count: this.successCount,
request_count: this.requestCount
}
});
}
getState(): string {
return this.state;
}
getMetrics() {
return {
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
requestCount: this.requestCount,
lastFailureTime: this.lastFailureTime
};
}
}
class CircuitBreakerRegistry {
private breakers: Map<string, CircuitBreaker> = new Map();
get(name: string, config?: CircuitBreakerConfig): CircuitBreaker {
if (!this.breakers.has(name)) {
const defaultConfig: CircuitBreakerConfig = {
failureThreshold: 5,
successThreshold: 2,
timeout: 60000,
volumeThreshold: 10
};
this.breakers.set(
name,
new CircuitBreaker(name, config || defaultConfig)
);
}
return this.breakers.get(name)!;
}
getMetrics(): Record<string, any> {
const metrics: Record<string, any> = {};
for (const [name, breaker] of this.breakers.entries()) {
metrics[name] = breaker.getMetrics();
}
return metrics;
}
}
// Global registry
const circuitBreakers = new CircuitBreakerRegistry();
async function callTwilioAPI(params: TwilioCallParams): Promise<CallResult> {
const breaker = circuitBreakers.get('twilio_voice', {
failureThreshold: 5,
successThreshold: 2,
timeout: 60000,
volumeThreshold: 10
});
try {
return await breaker.execute(async () => {
// Wrapped with retry logic
return await withRetry(
() => twilioClient.calls.create(params),
{
baseDelayMs: 2000,
maxDelayMs: 120000,
maxAttempts: 5,
jitterRatio: 0.5,
backoffMultiplier: 2
},
{
operationName: 'twilio_call_create',
entityId: params.assetId,
tenantId: params.tenantId
}
);
});
} catch (error) {
if (error instanceof CircuitOpenError) {
// Circuit breaker is open - record as error Fact
await appendFact({
type: 'error',
subtype: 'external',
timestamp: Date.now(),
tool_id: 'tool_twilio_voice',
data: {
error_code: 'CIRCUIT_OPEN',
error_message: 'Twilio circuit breaker is open',
error_type: 'circuit_breaker',
circuit_state: 'OPEN',
retry_attempt: 0,
max_retries: 0,
resolved: false
}
});
// Use fallback behavior
return handleTwilioUnavailable(params);
}
throw error;
}
}

When all retries fail, messages go to the dead letter queue for manual intervention.

interface DeadLetterMessage {
id: string;
original_queue: string;
message: any;
error: {
code: string;
message: string;
stack?: string;
};
attempts: number;
first_attempt_at: number;
last_attempt_at: number;
sent_to_dlq_at: number;
metadata: {
tenant_id?: string;
entity_id?: string;
trace_id?: string;
};
}
async function handleMaxRetriesExceeded(
message: QueueMessage,
error: Error,
attempts: number
): Promise<void> {
// 1. Record failure as Fact (Principle 8: Errors Are First-Class)
await appendFact({
type: 'error',
subtype: 'max_retries_exceeded',
timestamp: Date.now(),
tenant_id: message.tenantId,
entity_id: message.entityId,
data: {
error_code: error.code || 'UNKNOWN',
error_message: error.message,
error_type: 'terminal',
retry_attempt: attempts,
max_retries: attempts,
resolved: false,
queue_name: message.queue,
message_id: message.id
}
});
// 2. Send to dead letter queue
const dlqMessage: DeadLetterMessage = {
id: `dlq_${message.id}`,
original_queue: message.queue,
message: message.body,
error: {
code: error.code || 'UNKNOWN',
message: error.message,
stack: error.stack
},
attempts,
first_attempt_at: message.timestamp,
last_attempt_at: Date.now(),
sent_to_dlq_at: Date.now(),
metadata: {
tenant_id: message.tenantId,
entity_id: message.entityId,
trace_id: message.traceId
}
};
await env.DLQ.send(dlqMessage);
// 3. Alert operations team
await alertOps({
severity: 'warning',
title: 'Message sent to DLQ',
message: `Queue ${message.queue} message ${message.id} failed after ${attempts} attempts`,
metadata: dlqMessage.metadata
});
}
export default {
async queue(batch: MessageBatch<DeadLetterMessage>, env: Env): Promise<void> {
for (const message of batch.messages) {
try {
// Analyze failure pattern
const pattern = await analyzeFailurePattern(message);
if (pattern.type === 'transient_resolved') {
// Service recovered - retry
await retryFromDLQ(message, env);
} else if (pattern.type === 'configuration_error') {
// Alert for manual fix
await alertForManualIntervention(message, pattern);
} else if (pattern.type === 'data_corruption') {
// Archive for analysis
await archiveCorruptedMessage(message);
}
message.ack();
} catch (error) {
// DLQ processing failed - log and continue
console.error('DLQ processing error:', error);
message.retry();
}
}
}
};
async function retryFromDLQ(
dlqMessage: DeadLetterMessage,
env: Env
): Promise<void> {
// Record retry decision
await appendFact({
type: 'lifecycle',
subtype: 'dlq_retry',
timestamp: Date.now(),
data: {
dlq_message_id: dlqMessage.id,
original_queue: dlqMessage.original_queue,
reason: 'manual_retry'
}
});
// Re-enqueue to original queue
await env.QUEUES[dlqMessage.original_queue].send(
dlqMessage.message,
{
contentType: 'json',
headers: {
'X-Retry-From-DLQ': 'true',
'X-Original-Message-ID': dlqMessage.id,
'X-DLQ-Attempts': dlqMessage.attempts.toString()
}
}
);
}

Per Principle 1 (Economics Must Close the Loop), failures with economic impact need Facts.

async function recordFailureWithCost(
operation: string,
error: Error,
context: {
tool_id: string;
asset_id: string;
tenant_id: string;
config_id: string;
config_version: number;
},
cost: {
amount: number;
currency: string;
}
): Promise<void> {
// 1. Record error Fact
const errorFactId = await appendFact({
type: 'error',
subtype: 'tool_failure',
timestamp: Date.now(),
tenant_id: context.tenant_id,
tool_id: context.tool_id,
asset_id: context.asset_id,
config_id: context.config_id,
config_version: context.config_version,
data: {
error_code: error.code || 'UNKNOWN',
error_message: error.message,
error_type: determineErrorType(error),
operation,
resolved: false
}
});
// 2. Record cost Fact (even though operation failed)
await appendFact({
type: 'cost',
subtype: 'tool_usage',
timestamp: Date.now(),
tenant_id: context.tenant_id,
tool_id: context.tool_id,
asset_id: context.asset_id,
from_entity: context.tenant_id,
to_entity: 'vendor_twilio',
amount: cost.amount,
currency: cost.currency,
config_id: context.config_id,
config_version: context.config_version,
data: {
operation,
failed: true,
error_fact_id: errorFactId
}
});
}
async function handleTwilioCallFailure(
callParams: TwilioCallParams,
error: TwilioError
): Promise<void> {
// Twilio charges us for failed calls
const cost = calculateTwilioCost(callParams.duration || 0);
await recordFailureWithCost(
'twilio_call_create',
error,
{
tool_id: 'tool_twilio_voice',
asset_id: callParams.assetId,
tenant_id: callParams.tenantId,
config_id: callParams.configId,
config_version: callParams.configVersion
},
{
amount: cost,
currency: 'USD'
}
);
// Economic loop still closes - we track the cost even though call failed
}

Per Principle 6 (Configs Are Versioned), retries must use the original config_version.

T1: Client sends request using pricing Config v3
T2: Request times out
T3: Pricing Config updated to v4
T4: Retry uses v4 (WRONG)
Result: Inconsistent pricing applied to same operation
interface RetryContext {
operation: string;
idempotencyKey: string;
config_id: string;
config_version: number; // Lock to original version
tenant_id: string;
entity_id: string;
created_at: number;
}
async function executeWithConfigVersion<T>(
operation: () => Promise<T>,
context: RetryContext,
retryConfig: RetryConfig
): Promise<T> {
return await withRetry(
async () => {
// Always use original config_version
const config = await getConfigVersion(
context.config_id,
context.config_version
);
// Execute with locked config
return await operation();
},
retryConfig,
{
operationName: context.operation,
entityId: context.entity_id,
tenantId: context.tenant_id
}
);
}
async function appendFactWithRetry(
fact: Fact,
idempotencyKey: string
): Promise<FactResult> {
// Lock config version at operation start
const context: RetryContext = {
operation: 'append_fact',
idempotencyKey,
config_id: fact.config_id,
config_version: fact.config_version, // Lock to this version
tenant_id: fact.tenant_id,
entity_id: fact.entity_id,
created_at: Date.now()
};
return await executeWithConfigVersion(
() => durableObject.appendFact(fact, idempotencyKey),
context,
{
baseDelayMs: 1000,
maxDelayMs: 60000,
maxAttempts: 3,
jitterRatio: 0.5,
backoffMultiplier: 2
}
);
}

Track error patterns for operational insight:

interface ErrorMetrics {
// Counters
errors_total: Counter;
errors_by_type: Counter;
errors_by_tool: Counter;
retries_total: Counter;
retries_successful: Counter;
circuit_breaker_opens: Counter;
dlq_messages: Counter;
// Histograms
retry_duration: Histogram;
retry_attempts: Histogram;
// Gauges
circuit_breaker_state: Gauge;
dlq_depth: Gauge;
}
// Record error
metrics.errors_total.inc();
metrics.errors_by_type.inc({ type: error.type });
metrics.errors_by_tool.inc({ tool_id: error.tool_id });
// Record retry
metrics.retries_total.inc();
metrics.retry_duration.observe(duration);
metrics.retry_attempts.observe(attempts);
if (success) {
metrics.retries_successful.inc();
}

Configure alerts for error patterns:

const alertRules = [
{
name: 'high_error_rate',
condition: 'errors_total rate > 0.01', // 1% error rate
severity: 'warning',
message: 'Error rate exceeds threshold'
},
{
name: 'circuit_breaker_open',
condition: 'circuit_breaker_state == OPEN',
severity: 'critical',
message: 'Circuit breaker open for {circuit_name}'
},
{
name: 'dlq_growing',
condition: 'dlq_depth > 100 AND rate(dlq_depth) > 0',
severity: 'warning',
message: 'Dead letter queue growing'
},
{
name: 'retry_exhaustion',
condition: 'rate(dlq_messages) > 0.001', // Messages hitting DLQ
severity: 'warning',
message: 'Messages failing all retries'
}
];

Link errors across retries with trace IDs:

interface TraceContext {
trace_id: string;
span_id: string;
parent_span_id?: string;
operation: string;
start_time: number;
}
async function tracedOperation<T>(
operation: () => Promise<T>,
context: TraceContext
): Promise<T> {
const start = Date.now();
try {
const result = await operation();
// Record successful span
await recordSpan({
...context,
duration: Date.now() - start,
status: 'success'
});
return result;
} catch (error) {
// Record error span
await recordSpan({
...context,
duration: Date.now() - start,
status: 'error',
error: {
code: error.code,
message: error.message
}
});
throw error;
}
}

Wrong:

// DON'T: Retry validation errors
async function createEntity(data) {
return await withRetry(
() => fetch('/v1/entities', { method: 'POST', body: data }),
{ maxAttempts: 5 } // Will retry 400 errors
);
}

Right:

// DO: Only retry transient errors
async function createEntity(data) {
const response = await fetch('/v1/entities', {
method: 'POST',
body: data
});
if (!response.ok) {
const error = await response.json();
// Don't retry client errors
if (response.status >= 400 && response.status < 500) {
throw new ValidationError(error);
}
// Retry server errors
if (response.status >= 500) {
return await withRetry(
() => fetch('/v1/entities', { method: 'POST', body: data }),
{ maxAttempts: 3 }
);
}
}
return await response.json();
}

Wrong:

// DON'T: Retry without idempotency
async function appendFact(fact) {
return await withRetry(
() => durableObject.appendFact(fact), // No idempotency key
{ maxAttempts: 3 }
);
// Can create duplicate Facts on timeout
}

Right:

// DO: Always use idempotency keys
async function appendFact(fact) {
const idempotencyKey = generateIdempotencyKey(fact);
return await withRetry(
() => durableObject.appendFact(fact, idempotencyKey),
{ maxAttempts: 3 }
);
}

Wrong:

// DON'T: Fetch latest config on retry
async function processWithConfig(entity) {
return await withRetry(async () => {
const config = await getLatestConfig(entity.id); // Might change between retries
return await process(entity, config);
});
}

Right:

// DO: Lock config version at operation start
async function processWithConfig(entity) {
const config = await getLatestConfig(entity.id);
const configVersion = config.version; // Lock version
return await withRetry(async () => {
// Always use original version
const lockedConfig = await getConfigVersion(config.id, configVersion);
return await process(entity, lockedConfig);
});
}

Wrong:

// DON'T: Swallow errors without recording
async function callExternalAPI() {
try {
return await externalClient.call();
} catch (error) {
console.error('API call failed:', error); // Only logged
return null; // Silent failure
}
}

Right:

// DO: Record errors as Facts (Principle 8)
async function callExternalAPI() {
try {
return await externalClient.call();
} catch (error) {
// Record error Fact
await appendFact({
type: 'error',
subtype: 'external',
timestamp: Date.now(),
data: {
error_code: error.code,
error_message: error.message,
error_type: determineErrorType(error)
}
});
throw error; // Don't swallow
}
}

Wrong:

// DON'T: Retry forever
async function sendWebhook(url, data) {
while (true) {
try {
return await fetch(url, { method: 'POST', body: data });
} catch (error) {
await sleep(1000); // Retry forever
}
}
}

Right:

// DO: Set max attempts, use DLQ
async function sendWebhook(url, data) {
try {
return await withRetry(
() => fetch(url, { method: 'POST', body: data }),
{ maxAttempts: 5 }
);
} catch (error) {
// Send to DLQ after max retries
await sendToDLQ({ url, data, error });
throw error;
}
}

ConceptImplementation
Error TaxonomyTransient (retry), Client (don’t retry), Server (maybe), External (retry with backoff)
IdempotencyUnique keys per operation, cached results, 24-hour TTL
Retry StrategyExponential backoff with jitter, max attempts, context-aware policies
Circuit BreakerCLOSED/OPEN/HALF-OPEN states, failure thresholds, timeout recovery
Dead Letter QueueMax retries exceeded → DLQ → manual intervention
Failure FactsErrors are Facts (Principle 8), economic impact tracked (Principle 1)
Config VersioningLock to original version on retries (Principle 6)
ObservabilityMetrics, alerts, tracing, failure pattern analysis

Error handling in z0 respects the principles: Facts are immutable (idempotency prevents duplicates), Errors are First-Class (tracked as Facts), Economics Must Close (costs recorded even on failure), and Configs Are Versioned (retries use original version). The system fails gracefully, recovers automatically where possible, and never loses data.