Error Recovery Patterns - Virtuous API Docs

In production, errors are not exceptional — they’re expected. Networks fail. APIs return 5xx. Rate limits get hit. Validation errors surface from edge-case data. The question isn’t whether errors will happen but how your integration handles them when they do. A well-designed error-recovery system distinguishes transient from permanent failures, retries the right things at the right times, isolates failures so one bad record doesn’t poison the queue, and preserves enough audit trail to debug after the fact. This page covers the recovery patterns: error classification, retry strategies, circuit breakers, dead-letter queue management, and the specific Volunteer concerns (no idempotency keys, limited transactional semantics) that shape the right approach.

Principle 1: classify before you react

Not all errors are equal. Before deciding what to do about one, classify it:

Class	Examples	Treatment
Transient	`503 Service Unavailable`, timeouts, network errors	Retry with backoff
Rate-limited	`429 Too Many Requests`	Back off; respect `Retry-After` header
Authentication	`401 Unauthorized`	Stop; surface to operator; don’t retry
Authorization	`403 Forbidden`	Stop; the token lacks the needed scope
Not found	`404 Not Found`	Sometimes retry-after-delay (eventual consistency); often legitimate
Validation	`422 Unprocessable Entity`	Don’t retry; data is bad
Permanent server error	`500 Internal Server Error` (sustained)	Retry a few times; then surface
Conflict	`409 Conflict`	Resource was modified concurrently; fetch + retry

The wrong classification leads to the wrong action. Retrying a 422 indefinitely just wastes API budget; not retrying a 503 produces unnecessary sync gaps.

A classification helper

JavaScript

function classifyError(response, error) {
  if (error?.code === 'ETIMEDOUT' || error?.code === 'ECONNRESET') {
    return { class: 'transient_network', retryable: true };
  }

  if (!response) {
    return { class: 'unknown', retryable: false };
  }

  switch (response.status) {
    case 401:
      return { class: 'auth', retryable: false, surface: true };
    case 403:
      return { class: 'authorization', retryable: false, surface: true };
    case 404:
      return { class: 'not_found', retryable: false };
    case 409:
      return { class: 'conflict', retryable: true, maxRetries: 3 };
    case 422:
      return { class: 'validation', retryable: false, surface: true };
    case 429:
      return { class: 'rate_limited', retryable: true, useRetryAfter: true };
    case 500:
    case 502:
    case 503:
    case 504:
      return { class: 'server_error', retryable: true, maxRetries: 5 };
    default:
      return { class: 'unknown', retryable: false };
  }
}

The classifier is the foundation of every retry / surface decision downstream.

Principle 2: retry with exponential backoff + jitter

For retryable errors, don’t retry immediately. Wait, retry, wait longer, retry again. The pattern:

JavaScript

async function fetchWithRetry(url, options, maxRetries = 5) {
  let lastError;
  let lastResponse;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);

      const classification = classifyError(response);

      if (response.ok || !classification.retryable) {
        return response; // success or permanent failure
      }

      // Compute backoff
      let backoffMs;
      if (classification.useRetryAfter) {
        const retryAfter = response.headers.get('Retry-After');
        backoffMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : 60000;
      } else {
        // Exponential backoff with jitter
        const base = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
        const jitter = Math.random() * 0.3 * base; // ±30%
        backoffMs = base + jitter;
      }

      if (attempt < maxRetries) {
        await sleep(backoffMs);
      } else {
        lastResponse = response;
      }
    } catch (err) {
      lastError = err;

      const classification = classifyError(null, err);
      if (!classification.retryable || attempt === maxRetries) {
        throw err;
      }

      const backoffMs = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
      await sleep(backoffMs);
    }
  }

  if (lastResponse) return lastResponse;
  throw lastError ?? new Error('Retry exhausted');
}

Why exponential

Linear backoff (1s, 2s, 3s) is too aggressive on persistent failures. Exponential (1s, 2s, 4s, 8s, 16s) gives the system time to recover between attempts while not waiting forever on transient blips.

Why jitter

Without jitter, multiple workers experiencing the same failure all retry at exactly the same intervals — producing a thundering herd that overwhelms the recovering API. Random jitter spreads out the retry attempts.

Why `Retry-After`

For 429, the API explicitly tells you when to retry. Respect it. Don’t compute your own backoff; the API knows when it’ll be ready.

Bounded retries

After max retries, stop and surface. Indefinite retries hide systemic problems.

Principle 3: per-record isolation in batches

When processing a batch (a poll cycle, a backfill page), don’t let one bad record fail the whole batch:

JavaScript

async function processBatch(users, customerId) {
  const successes = [];
  const failures = [];

  for (const user of users) {
    try {
      await processUser(user);
      successes.push(user);
    } catch (err) {
      failures.push({ user, error: err });
    }
  }

  // Continue with what succeeded
  await advanceCheckpoint(customerId, successes);

  // Queue failures for separate handling
  if (failures.length > 0) {
    await deadLetterQueue.publishBatch(failures);
  }

  return { successCount: successes.length, failureCount: failures.length };
}

Per-record try/catch means a single 422 on user #47 doesn’t stop processing of users #48-#150.

Checkpoint advancement with partial failures

This is subtle: if some records in a batch succeed and others fail, advancing the checkpoint past the failed ones means they’re “skipped” forever (next poll won’t re-see them). Two approaches: Approach A: advance checkpoint only to the most-recent succeeded record:

JavaScript

let latestSuccessfulUpdate = currentCheckpoint;
for (const user of users) {
  try {
    await processUser(user);
    const u = new Date(user.updated_at);
    if (u > latestSuccessfulUpdate) latestSuccessfulUpdate = u;
  } catch (err) {
    // Don't advance — but rely on the failed record being in the next poll's range
    failures.push({ user, error: err });
  }
}
await setCheckpoint(latestSuccessfulUpdate);

This works if you sort by updated_at ascending. But records with the same updated_at (rare) might be re-processed. Approach B: advance fully, but capture failures in DLQ:

JavaScript

let latestSeen = currentCheckpoint;
for (const user of users) {
  try {
    await processUser(user);
  } catch (err) {
    await deadLetterQueue.publish({ user, error: err });
  }
  const u = new Date(user.updated_at);
  if (u > latestSeen) latestSeen = u;
}
await setCheckpoint(latestSeen);

// Periodically process the DLQ for retry

Approach B is more common. The DLQ becomes the authoritative source of “things that failed and need re-processing.”

Principle 4: dead-letter queue management

Failures go to a DLQ. But the DLQ isn’t self-cleaning — it needs its own processing.

DLQ schema

CREATE TABLE dead_letter_queue (
  id              SERIAL PRIMARY KEY,
  customer_id     VARCHAR NOT NULL,
  resource_type   VARCHAR NOT NULL,  -- 'user', 'project', etc.
  record_id       VARCHAR NOT NULL,
  payload         JSONB NOT NULL,
  error_class     VARCHAR,
  error_message   TEXT,
  attempt_count   INTEGER DEFAULT 1,
  first_failed_at TIMESTAMP DEFAULT NOW(),
  last_attempt_at TIMESTAMP DEFAULT NOW(),
  next_retry_at   TIMESTAMP DEFAULT NOW(),
  resolved_at     TIMESTAMP,

  -- For investigation
  trace_id        VARCHAR
);

CREATE INDEX dlq_next_retry ON dead_letter_queue (next_retry_at)
  WHERE resolved_at IS NULL;
CREATE INDEX dlq_by_customer ON dead_letter_queue (customer_id, first_failed_at DESC)
  WHERE resolved_at IS NULL;

DLQ processor

JavaScript

async function processDeadLetterQueue() {
  const now = new Date();
  const items = await db.query(`
    SELECT * FROM dead_letter_queue
    WHERE resolved_at IS NULL
      AND next_retry_at <= $1
      AND attempt_count < 10
    ORDER BY next_retry_at
    LIMIT 100
  `, [now]);

  for (const item of items.rows) {
    try {
      // Re-process the record
      await reprocessFromDeadLetter(item);

      // Success — mark resolved
      await db.query(`
        UPDATE dead_letter_queue SET resolved_at = NOW() WHERE id = $1
      `, [item.id]);
    } catch (err) {
      // Still failing — schedule another retry with exponential backoff
      const nextDelay = Math.min(
        Math.pow(2, item.attempt_count) * 60 * 1000, // exponential
        24 * 60 * 60 * 1000 // cap at 24 hours
      );

      await db.query(`
        UPDATE dead_letter_queue
        SET attempt_count = attempt_count + 1,
            last_attempt_at = NOW(),
            next_retry_at = NOW() + interval '${nextDelay} milliseconds',
            error_message = $1
        WHERE id = $2
      `, [err.message, item.id]);
    }
  }
}

Run this on a schedule (every 5 minutes or so). The DLQ self-heals over time as transient errors clear up.

Permanent failures

After 10 retries spread over hours/days, give up. Mark for manual review:

JavaScript

async function escalatePermanentFailures() {
  await db.query(`
    UPDATE dead_letter_queue
    SET requires_manual_review = true
    WHERE resolved_at IS NULL
      AND attempt_count >= 10
      AND requires_manual_review = false
  `);

  // Alert ops
  const counts = await db.query(`
    SELECT customer_id, error_class, COUNT(*)
    FROM dead_letter_queue
    WHERE requires_manual_review = true
      AND resolved_at IS NULL
    GROUP BY customer_id, error_class
  `);

  for (const row of counts.rows) {
    await alertOps({
      severity: 'medium',
      customerId: row.customer_id,
      type: 'dlq_permanent_failures',
      errorClass: row.error_class,
      count: row.count,
    });
  }
}

Operators investigate; usually the resolution is either fix the data, fix the integration code, or accept the loss.

Principle 5: circuit breakers for cascade prevention

When the API or a downstream system is broken, retrying makes things worse — wasting requests on a system that can’t respond. A circuit breaker detects sustained failure and pauses operations:

JavaScript

class CircuitBreaker {
  constructor({ threshold = 5, timeoutMs = 60000 }) {
    this.threshold = threshold;
    this.timeoutMs = timeoutMs;
    this.state = 'closed'; // 'closed' | 'open' | 'half_open'
    this.failureCount = 0;
    this.openedAt = null;
  }

  async execute(fn) {
    if (this.state === 'open') {
      const sinceOpen = Date.now() - this.openedAt;
      if (sinceOpen < this.timeoutMs) {
        throw new CircuitOpenError('Circuit is open');
      }
      this.state = 'half_open';
    }

    try {
      const result = await fn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err;
    }
  }

  _onSuccess() {
    this.failureCount = 0;
    if (this.state === 'half_open') {
      this.state = 'closed';
    }
  }

  _onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'open';
      this.openedAt = Date.now();
    }
  }
}

const vomoBreaker = new CircuitBreaker({ threshold: 5, timeoutMs: 60000 });

async function callVomo(fn) {
  return vomoBreaker.execute(fn);
}

How it works

State	Behavior
Closed (normal)	All requests pass through; failures increment count
Open (broken)	All requests immediately fail without hitting the API; wait `timeoutMs`
Half-open (recovering)	After timeout, allow one request through; if it succeeds, close; if it fails, re-open

The pattern prevents thundering-herd retries against a broken system. When VOMO is having issues, the integration pauses, then probes carefully, then resumes.

Per-customer vs. global breakers

For multi-tenant integrations:

Global breaker on the VOMO API: opens when VOMO itself is down — affects all customers
Per-customer breaker on the external destination: opens when a specific customer’s destination is failing — isolates the impact

JavaScript

class PerCustomerBreaker {
  constructor() {
    this.breakers = new Map();
  }

  getBreaker(customerId) {
    if (!this.breakers.has(customerId)) {
      this.breakers.set(customerId, new CircuitBreaker({ threshold: 5, timeoutMs: 60000 }));
    }
    return this.breakers.get(customerId);
  }
}

A customer with a broken Salesforce destination doesn’t block another customer with a working HubSpot.

Principle 6: idempotency in the absence of API support

VOMO’s API doesn’t support idempotency keys (no Idempotency-Key header). This means:

A retried POST /users with the same payload upserts (safe; the email match handles idempotency)
A retried POST /groups creates a new Group (not safe — produces duplicates)
A retried DELETE /groups/{id} is safe (already deleted = no-op)

For the unsafe cases, your integration must provide idempotency at the partner-side state level:

JavaScript

async function createGroupIdempotent(customerId, specId, groupData) {
  // Check if we've already created this Group from this spec
  const existing = await db.getGroupForSpec(customerId, specId);
  if (existing) {
    // Already created — return the existing ID
    return existing;
  }

  // Create
  const response = await fetch('https://api.vomo.org/v1/groups', {
    method: 'POST',
    headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' },
    body: JSON.stringify(groupData),
  });

  if (!response.ok) throw new Error(`Create failed: ${response.status}`);
  const created = await response.json();

  // Persist the mapping BEFORE returning
  await db.recordGroupForSpec(customerId, specId, created.data.id);

  return created.data;
}

The pattern: persist intent before the API call; check intent before re-trying. If the call succeeded but the response was lost (network failure), the next attempt sees the persisted intent and skips.

When this isn’t enough

For operations that can fail between “API succeeded” and “we recorded the result,” you have a brief window where:

The API created the Group
The network response was lost
The retry creates a new Group

Without idempotency keys on the API, this race is fundamental. Mitigations:

After failures, search for the just-created resource before retrying. For Groups, list recently-created Groups; if one matches your intent, use it instead of creating a new one
For non-Group resources (Users via upsert, Project via PUT), the upsert/PUT semantics naturally idempotent — the worst case is “second attempt finds it already in the desired state”

For VOMO, this is mostly an issue with Group creates. Most other operations are naturally idempotent.

Principle 7: graceful degradation

When some part of the integration is broken, what continues working? Design for partial functionality:

Tiered functionality

JavaScript

async function getCustomerDashboard(customerId) {
  const sections = await Promise.allSettled([
    getRecentParticipations(customerId),
    getVolunteerCount(customerId),
    getProjectSummary(customerId),
    getCertificateStatus(customerId),
  ]);

  return {
    participations: sections[0].status === 'fulfilled' ? sections[0].value : null,
    volunteerCount: sections[1].status === 'fulfilled' ? sections[1].value : null,
    projectSummary: sections[2].status === 'fulfilled' ? sections[2].value : null,
    certificates: sections[3].status === 'fulfilled' ? sections[3].value : null,
    // Each section can fail independently; dashboard shows what's available
  };
}

Using Promise.allSettled (vs. Promise.all) means one failed fetch doesn’t fail the whole dashboard. The UI shows “data temporarily unavailable” for the broken section, but the rest is visible.

Cache fallback

JavaScript

async function getResilient(key, fetchFn, cache) {
  try {
    const fresh = await fetchFn();
    await cache.set(key, fresh);
    return fresh;
  } catch (err) {
    const stale = await cache.get(key, { allowStale: true });
    if (stale) {
      return { ...stale, isStale: true };
    }
    throw err;
  }
}

When the fresh fetch fails, fall back to cached data — even if it’s expired. Mark it as stale so UI can show “data is from N minutes ago.” This is essential for customer-facing UIs where “the page is completely broken” is worse than “the data is slightly out of date.”

Principle 8: comprehensive error logging

When something breaks, the team needs to debug it. Structured error logging is the foundation:

JavaScript

async function processWithLogging(customerId, user, traceId) {
  const context = { customerId, userId: user.id, traceId };

  try {
    logger.info('Processing user', context);
    await processUser(user);
    logger.info('User processed successfully', context);
  } catch (err) {
    logger.error('User processing failed', {
      ...context,
      error: {
        message: err.message,
        stack: err.stack,
        class: err.constructor.name,
        status: err.response?.status,
        body: err.response?.body,
      },
    });
    throw err;
  }
}

The structured log includes:

The trace ID (correlates across requests)
The specific record being processed
The full error including HTTP context (status, body)
The error class for grouping

In an aggregator (DataDog, ELK, etc.), you can filter for “all failures with status 422 for customer X” and surface patterns.

Principle 9: alert on patterns, not individual failures

A single failure isn’t alarming. A pattern of failures is. Alert thresholds should reflect this:

Pattern	Alert
One 5xx in an hour	Don’t alert (probably transient)
5+ 5xx in 5 minutes	Alert (sustained issue)
Same customer has many failures, others normal	Alert (customer-specific issue)
Same error class concentrates after a deployment	Alert (likely regression)
DLQ growth rate accelerates	Alert (something systemic)
Polling lag exceeds 2x normal cadence	Alert (worker stalled?)

Build alerts on patterns. Suppress noise from one-off transients. The team should only be paged when intervention is genuinely needed.

Principle 10: error budget thinking

For long-running integrations, embrace the idea of an error budget: a defined “acceptable” error rate, below which alerts don’t fire.

JavaScript

async function getErrorBudgetStatus(customerId) {
  const past24h = await db.query(`
    SELECT
      COUNT(*) AS total_operations,
      COUNT(*) FILTER (WHERE succeeded = false) AS failed_operations
    FROM sync_audit
    WHERE customer_id = $1
      AND processed_at > NOW() - INTERVAL '24 hours'
  `, [customerId]);

  const errorRate = past24h.failed_operations / Math.max(past24h.total_operations, 1);
  const budget = 0.005; // 0.5% error budget

  return {
    errorRate,
    budget,
    burnedBudget: errorRate / budget,
    status: errorRate < budget ? 'healthy' : 'over_budget',
  };
}

Budget burn	Action
0-100%	No action needed
100-200%	Heightened monitoring; investigation
>200%	Alert; consider pausing the integration

The pattern accepts that some level of failure is normal while triggering action when failure rates exceed expectations.

A reference resilient integration

The patterns combined:

JavaScript

class ResilientVomoIntegration {
  constructor({ customerId, token }) {
    this.customerId = customerId;
    this.token = token;
    this.client = new ThrottledVomoClient({ token, requestsPerSecond: 3 });
    this.breaker = new CircuitBreaker({ threshold: 5, timeoutMs: 60000 });
  }

  async fetchUserSafely(userId, traceId) {
    return this.breaker.execute(async () => {
      const response = await fetchWithRetry(
        `https://api.vomo.org/v1/users/${userId}`,
        { headers: { Authorization: `Bearer ${this.token}` } },
        5 // max retries
      );

      if (!response.ok) {
        const classification = classifyError(response);
        if (classification.surface) {
          await alertOps({
            severity: 'medium',
            customerId: this.customerId,
            type: 'api_failure',
            status: response.status,
            traceId,
          });
        }
        throw new ApiError(response.status, await response.text());
      }

      return response.json();
    });
  }

  async pollWithRecovery() {
    const traceId = generateTraceId();
    try {
      return await this._poll(traceId);
    } catch (err) {
      logger.error('Poll cycle failed', {
        customerId: this.customerId,
        traceId,
        error: err.message,
      });

      // Don't advance checkpoint on systemic failures
      // Next cycle will re-attempt from the same point
      throw err;
    }
  }

  async processRecordWithDlq(user, traceId) {
    try {
      await processUser(user);
    } catch (err) {
      const classification = classifyError(err.response, err);

      if (classification.surface) {
        // Don't retry validation errors etc.; surface immediately
        await deadLetterQueue.publish({
          customerId: this.customerId,
          resourceType: 'user',
          recordId: user.id,
          payload: user,
          errorClass: classification.class,
          errorMessage: err.message,
          traceId,
          requiresReview: true,
        });
      } else {
        // Transient — DLQ for retry
        await deadLetterQueue.publish({
          customerId: this.customerId,
          resourceType: 'user',
          recordId: user.id,
          payload: user,
          errorClass: classification.class,
          errorMessage: err.message,
          traceId,
        });
      }
    }
  }
}

The patterns layered: retry → circuit breaker → DLQ → alerting. Each layer handles failures the previous can’t, and together they produce an integration that runs continuously through transient and persistent issues alike.

Where to go next

Sync Architecture Patterns

The broader architectural patterns these recovery practices fit into.

API Performance Tips

The performance patterns these resilience patterns coexist with.

Data Modeling

The data model that supports the DLQ and audit log patterns.

Change Detection Best Practices

The change-detection reliability practices this builds on.

​Principle 1: classify before you react

​A classification helper

​Principle 2: retry with exponential backoff + jitter

​Why exponential

​Why jitter

​Why Retry-After

​Bounded retries

​Principle 3: per-record isolation in batches

​Checkpoint advancement with partial failures

​Principle 4: dead-letter queue management

​DLQ schema

​DLQ processor

​Permanent failures

​Principle 5: circuit breakers for cascade prevention

​How it works

​Per-customer vs. global breakers

​Principle 6: idempotency in the absence of API support

​When this isn’t enough

​Principle 7: graceful degradation

​Tiered functionality

​Cache fallback

​Principle 8: comprehensive error logging

​Principle 9: alert on patterns, not individual failures

​Principle 10: error budget thinking

​A reference resilient integration

​Where to go next

Sync Architecture Patterns

API Performance Tips

Data Modeling

Change Detection Best Practices

Principle 1: classify before you react

A classification helper

Principle 2: retry with exponential backoff + jitter

Why exponential

Why jitter

Why `Retry-After`

Bounded retries

Principle 3: per-record isolation in batches

Checkpoint advancement with partial failures

Principle 4: dead-letter queue management

DLQ schema

DLQ processor

Permanent failures

Principle 5: circuit breakers for cascade prevention

How it works

Per-customer vs. global breakers

Principle 6: idempotency in the absence of API support

When this isn’t enough

Principle 7: graceful degradation

Tiered functionality

Cache fallback

Principle 8: comprehensive error logging

Principle 9: alert on patterns, not individual failures

Principle 10: error budget thinking

A reference resilient integration

Where to go next