Skip to main content
In production, errors are not exceptional — they’re expected. Networks fail. APIs return 5xx. Rate limits get hit. Validation errors surface from edge-case data. The question isn’t whether errors will happen but how your integration handles them when they do. A well-designed error-recovery system distinguishes transient from permanent failures, retries the right things at the right times, isolates failures so one bad record doesn’t poison the queue, and preserves enough audit trail to debug after the fact. This page covers the recovery patterns: error classification, retry strategies, circuit breakers, dead-letter queue management, and the specific Volunteer concerns (no idempotency keys, limited transactional semantics) that shape the right approach.

Principle 1: classify before you react

Not all errors are equal. Before deciding what to do about one, classify it:
ClassExamplesTreatment
Transient503 Service Unavailable, timeouts, network errorsRetry with backoff
Rate-limited429 Too Many RequestsBack off; respect Retry-After header
Authentication401 UnauthorizedStop; surface to operator; don’t retry
Authorization403 ForbiddenStop; the token lacks the needed scope
Not found404 Not FoundSometimes retry-after-delay (eventual consistency); often legitimate
Validation422 Unprocessable EntityDon’t retry; data is bad
Permanent server error500 Internal Server Error (sustained)Retry a few times; then surface
Conflict409 ConflictResource was modified concurrently; fetch + retry
The wrong classification leads to the wrong action. Retrying a 422 indefinitely just wastes API budget; not retrying a 503 produces unnecessary sync gaps.

A classification helper

JavaScript
function classifyError(response, error) {
  if (error?.code === 'ETIMEDOUT' || error?.code === 'ECONNRESET') {
    return { class: 'transient_network', retryable: true };
  }

  if (!response) {
    return { class: 'unknown', retryable: false };
  }

  switch (response.status) {
    case 401:
      return { class: 'auth', retryable: false, surface: true };
    case 403:
      return { class: 'authorization', retryable: false, surface: true };
    case 404:
      return { class: 'not_found', retryable: false };
    case 409:
      return { class: 'conflict', retryable: true, maxRetries: 3 };
    case 422:
      return { class: 'validation', retryable: false, surface: true };
    case 429:
      return { class: 'rate_limited', retryable: true, useRetryAfter: true };
    case 500:
    case 502:
    case 503:
    case 504:
      return { class: 'server_error', retryable: true, maxRetries: 5 };
    default:
      return { class: 'unknown', retryable: false };
  }
}
The classifier is the foundation of every retry / surface decision downstream.

Principle 2: retry with exponential backoff + jitter

For retryable errors, don’t retry immediately. Wait, retry, wait longer, retry again. The pattern:
JavaScript
async function fetchWithRetry(url, options, maxRetries = 5) {
  let lastError;
  let lastResponse;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);

      const classification = classifyError(response);

      if (response.ok || !classification.retryable) {
        return response; // success or permanent failure
      }

      // Compute backoff
      let backoffMs;
      if (classification.useRetryAfter) {
        const retryAfter = response.headers.get('Retry-After');
        backoffMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : 60000;
      } else {
        // Exponential backoff with jitter
        const base = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
        const jitter = Math.random() * 0.3 * base; // ±30%
        backoffMs = base + jitter;
      }

      if (attempt < maxRetries) {
        await sleep(backoffMs);
      } else {
        lastResponse = response;
      }
    } catch (err) {
      lastError = err;

      const classification = classifyError(null, err);
      if (!classification.retryable || attempt === maxRetries) {
        throw err;
      }

      const backoffMs = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
      await sleep(backoffMs);
    }
  }

  if (lastResponse) return lastResponse;
  throw lastError ?? new Error('Retry exhausted');
}

Why exponential

Linear backoff (1s, 2s, 3s) is too aggressive on persistent failures. Exponential (1s, 2s, 4s, 8s, 16s) gives the system time to recover between attempts while not waiting forever on transient blips.

Why jitter

Without jitter, multiple workers experiencing the same failure all retry at exactly the same intervals — producing a thundering herd that overwhelms the recovering API. Random jitter spreads out the retry attempts.

Why Retry-After

For 429, the API explicitly tells you when to retry. Respect it. Don’t compute your own backoff; the API knows when it’ll be ready.

Bounded retries

After max retries, stop and surface. Indefinite retries hide systemic problems.

Principle 3: per-record isolation in batches

When processing a batch (a poll cycle, a backfill page), don’t let one bad record fail the whole batch:
JavaScript
async function processBatch(users, customerId) {
  const successes = [];
  const failures = [];

  for (const user of users) {
    try {
      await processUser(user);
      successes.push(user);
    } catch (err) {
      failures.push({ user, error: err });
    }
  }

  // Continue with what succeeded
  await advanceCheckpoint(customerId, successes);

  // Queue failures for separate handling
  if (failures.length > 0) {
    await deadLetterQueue.publishBatch(failures);
  }

  return { successCount: successes.length, failureCount: failures.length };
}
Per-record try/catch means a single 422 on user #47 doesn’t stop processing of users #48-#150.

Checkpoint advancement with partial failures

This is subtle: if some records in a batch succeed and others fail, advancing the checkpoint past the failed ones means they’re “skipped” forever (next poll won’t re-see them). Two approaches: Approach A: advance checkpoint only to the most-recent succeeded record:
JavaScript
let latestSuccessfulUpdate = currentCheckpoint;
for (const user of users) {
  try {
    await processUser(user);
    const u = new Date(user.updated_at);
    if (u > latestSuccessfulUpdate) latestSuccessfulUpdate = u;
  } catch (err) {
    // Don't advance — but rely on the failed record being in the next poll's range
    failures.push({ user, error: err });
  }
}
await setCheckpoint(latestSuccessfulUpdate);
This works if you sort by updated_at ascending. But records with the same updated_at (rare) might be re-processed. Approach B: advance fully, but capture failures in DLQ:
JavaScript
let latestSeen = currentCheckpoint;
for (const user of users) {
  try {
    await processUser(user);
  } catch (err) {
    await deadLetterQueue.publish({ user, error: err });
  }
  const u = new Date(user.updated_at);
  if (u > latestSeen) latestSeen = u;
}
await setCheckpoint(latestSeen);

// Periodically process the DLQ for retry
Approach B is more common. The DLQ becomes the authoritative source of “things that failed and need re-processing.”

Principle 4: dead-letter queue management

Failures go to a DLQ. But the DLQ isn’t self-cleaning — it needs its own processing.

DLQ schema

CREATE TABLE dead_letter_queue (
  id              SERIAL PRIMARY KEY,
  customer_id     VARCHAR NOT NULL,
  resource_type   VARCHAR NOT NULL,  -- 'user', 'project', etc.
  record_id       VARCHAR NOT NULL,
  payload         JSONB NOT NULL,
  error_class     VARCHAR,
  error_message   TEXT,
  attempt_count   INTEGER DEFAULT 1,
  first_failed_at TIMESTAMP DEFAULT NOW(),
  last_attempt_at TIMESTAMP DEFAULT NOW(),
  next_retry_at   TIMESTAMP DEFAULT NOW(),
  resolved_at     TIMESTAMP,

  -- For investigation
  trace_id        VARCHAR
);

CREATE INDEX dlq_next_retry ON dead_letter_queue (next_retry_at)
  WHERE resolved_at IS NULL;
CREATE INDEX dlq_by_customer ON dead_letter_queue (customer_id, first_failed_at DESC)
  WHERE resolved_at IS NULL;

DLQ processor

JavaScript
async function processDeadLetterQueue() {
  const now = new Date();
  const items = await db.query(`
    SELECT * FROM dead_letter_queue
    WHERE resolved_at IS NULL
      AND next_retry_at <= $1
      AND attempt_count < 10
    ORDER BY next_retry_at
    LIMIT 100
  `, [now]);

  for (const item of items.rows) {
    try {
      // Re-process the record
      await reprocessFromDeadLetter(item);

      // Success — mark resolved
      await db.query(`
        UPDATE dead_letter_queue SET resolved_at = NOW() WHERE id = $1
      `, [item.id]);
    } catch (err) {
      // Still failing — schedule another retry with exponential backoff
      const nextDelay = Math.min(
        Math.pow(2, item.attempt_count) * 60 * 1000, // exponential
        24 * 60 * 60 * 1000 // cap at 24 hours
      );

      await db.query(`
        UPDATE dead_letter_queue
        SET attempt_count = attempt_count + 1,
            last_attempt_at = NOW(),
            next_retry_at = NOW() + interval '${nextDelay} milliseconds',
            error_message = $1
        WHERE id = $2
      `, [err.message, item.id]);
    }
  }
}
Run this on a schedule (every 5 minutes or so). The DLQ self-heals over time as transient errors clear up.

Permanent failures

After 10 retries spread over hours/days, give up. Mark for manual review:
JavaScript
async function escalatePermanentFailures() {
  await db.query(`
    UPDATE dead_letter_queue
    SET requires_manual_review = true
    WHERE resolved_at IS NULL
      AND attempt_count >= 10
      AND requires_manual_review = false
  `);

  // Alert ops
  const counts = await db.query(`
    SELECT customer_id, error_class, COUNT(*)
    FROM dead_letter_queue
    WHERE requires_manual_review = true
      AND resolved_at IS NULL
    GROUP BY customer_id, error_class
  `);

  for (const row of counts.rows) {
    await alertOps({
      severity: 'medium',
      customerId: row.customer_id,
      type: 'dlq_permanent_failures',
      errorClass: row.error_class,
      count: row.count,
    });
  }
}
Operators investigate; usually the resolution is either fix the data, fix the integration code, or accept the loss.

Principle 5: circuit breakers for cascade prevention

When the API or a downstream system is broken, retrying makes things worse — wasting requests on a system that can’t respond. A circuit breaker detects sustained failure and pauses operations:
JavaScript
class CircuitBreaker {
  constructor({ threshold = 5, timeoutMs = 60000 }) {
    this.threshold = threshold;
    this.timeoutMs = timeoutMs;
    this.state = 'closed'; // 'closed' | 'open' | 'half_open'
    this.failureCount = 0;
    this.openedAt = null;
  }

  async execute(fn) {
    if (this.state === 'open') {
      const sinceOpen = Date.now() - this.openedAt;
      if (sinceOpen < this.timeoutMs) {
        throw new CircuitOpenError('Circuit is open');
      }
      this.state = 'half_open';
    }

    try {
      const result = await fn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err;
    }
  }

  _onSuccess() {
    this.failureCount = 0;
    if (this.state === 'half_open') {
      this.state = 'closed';
    }
  }

  _onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'open';
      this.openedAt = Date.now();
    }
  }
}

const vomoBreaker = new CircuitBreaker({ threshold: 5, timeoutMs: 60000 });

async function callVomo(fn) {
  return vomoBreaker.execute(fn);
}

How it works

StateBehavior
Closed (normal)All requests pass through; failures increment count
Open (broken)All requests immediately fail without hitting the API; wait timeoutMs
Half-open (recovering)After timeout, allow one request through; if it succeeds, close; if it fails, re-open
The pattern prevents thundering-herd retries against a broken system. When VOMO is having issues, the integration pauses, then probes carefully, then resumes.

Per-customer vs. global breakers

For multi-tenant integrations:
  • Global breaker on the VOMO API: opens when VOMO itself is down — affects all customers
  • Per-customer breaker on the external destination: opens when a specific customer’s destination is failing — isolates the impact
JavaScript
class PerCustomerBreaker {
  constructor() {
    this.breakers = new Map();
  }

  getBreaker(customerId) {
    if (!this.breakers.has(customerId)) {
      this.breakers.set(customerId, new CircuitBreaker({ threshold: 5, timeoutMs: 60000 }));
    }
    return this.breakers.get(customerId);
  }
}
A customer with a broken Salesforce destination doesn’t block another customer with a working HubSpot.

Principle 6: idempotency in the absence of API support

VOMO’s API doesn’t support idempotency keys (no Idempotency-Key header). This means:
  • A retried POST /users with the same payload upserts (safe; the email match handles idempotency)
  • A retried POST /groups creates a new Group (not safe — produces duplicates)
  • A retried DELETE /groups/{id} is safe (already deleted = no-op)
For the unsafe cases, your integration must provide idempotency at the partner-side state level:
JavaScript
async function createGroupIdempotent(customerId, specId, groupData) {
  // Check if we've already created this Group from this spec
  const existing = await db.getGroupForSpec(customerId, specId);
  if (existing) {
    // Already created — return the existing ID
    return existing;
  }

  // Create
  const response = await fetch('https://api.vomo.org/v1/groups', {
    method: 'POST',
    headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' },
    body: JSON.stringify(groupData),
  });

  if (!response.ok) throw new Error(`Create failed: ${response.status}`);
  const created = await response.json();

  // Persist the mapping BEFORE returning
  await db.recordGroupForSpec(customerId, specId, created.data.id);

  return created.data;
}
The pattern: persist intent before the API call; check intent before re-trying. If the call succeeded but the response was lost (network failure), the next attempt sees the persisted intent and skips.

When this isn’t enough

For operations that can fail between “API succeeded” and “we recorded the result,” you have a brief window where:
  • The API created the Group
  • The network response was lost
  • The retry creates a new Group
Without idempotency keys on the API, this race is fundamental. Mitigations:
  • After failures, search for the just-created resource before retrying. For Groups, list recently-created Groups; if one matches your intent, use it instead of creating a new one
  • For non-Group resources (Users via upsert, Project via PUT), the upsert/PUT semantics naturally idempotent — the worst case is “second attempt finds it already in the desired state”
For VOMO, this is mostly an issue with Group creates. Most other operations are naturally idempotent.

Principle 7: graceful degradation

When some part of the integration is broken, what continues working? Design for partial functionality:

Tiered functionality

JavaScript
async function getCustomerDashboard(customerId) {
  const sections = await Promise.allSettled([
    getRecentParticipations(customerId),
    getVolunteerCount(customerId),
    getProjectSummary(customerId),
    getCertificateStatus(customerId),
  ]);

  return {
    participations: sections[0].status === 'fulfilled' ? sections[0].value : null,
    volunteerCount: sections[1].status === 'fulfilled' ? sections[1].value : null,
    projectSummary: sections[2].status === 'fulfilled' ? sections[2].value : null,
    certificates: sections[3].status === 'fulfilled' ? sections[3].value : null,
    // Each section can fail independently; dashboard shows what's available
  };
}
Using Promise.allSettled (vs. Promise.all) means one failed fetch doesn’t fail the whole dashboard. The UI shows “data temporarily unavailable” for the broken section, but the rest is visible.

Cache fallback

JavaScript
async function getResilient(key, fetchFn, cache) {
  try {
    const fresh = await fetchFn();
    await cache.set(key, fresh);
    return fresh;
  } catch (err) {
    const stale = await cache.get(key, { allowStale: true });
    if (stale) {
      return { ...stale, isStale: true };
    }
    throw err;
  }
}
When the fresh fetch fails, fall back to cached data — even if it’s expired. Mark it as stale so UI can show “data is from N minutes ago.” This is essential for customer-facing UIs where “the page is completely broken” is worse than “the data is slightly out of date.”

Principle 8: comprehensive error logging

When something breaks, the team needs to debug it. Structured error logging is the foundation:
JavaScript
async function processWithLogging(customerId, user, traceId) {
  const context = { customerId, userId: user.id, traceId };

  try {
    logger.info('Processing user', context);
    await processUser(user);
    logger.info('User processed successfully', context);
  } catch (err) {
    logger.error('User processing failed', {
      ...context,
      error: {
        message: err.message,
        stack: err.stack,
        class: err.constructor.name,
        status: err.response?.status,
        body: err.response?.body,
      },
    });
    throw err;
  }
}
The structured log includes:
  • The trace ID (correlates across requests)
  • The specific record being processed
  • The full error including HTTP context (status, body)
  • The error class for grouping
In an aggregator (DataDog, ELK, etc.), you can filter for “all failures with status 422 for customer X” and surface patterns.

Principle 9: alert on patterns, not individual failures

A single failure isn’t alarming. A pattern of failures is. Alert thresholds should reflect this:
PatternAlert
One 5xx in an hourDon’t alert (probably transient)
5+ 5xx in 5 minutesAlert (sustained issue)
Same customer has many failures, others normalAlert (customer-specific issue)
Same error class concentrates after a deploymentAlert (likely regression)
DLQ growth rate acceleratesAlert (something systemic)
Polling lag exceeds 2x normal cadenceAlert (worker stalled?)
Build alerts on patterns. Suppress noise from one-off transients. The team should only be paged when intervention is genuinely needed.

Principle 10: error budget thinking

For long-running integrations, embrace the idea of an error budget: a defined “acceptable” error rate, below which alerts don’t fire.
JavaScript
async function getErrorBudgetStatus(customerId) {
  const past24h = await db.query(`
    SELECT
      COUNT(*) AS total_operations,
      COUNT(*) FILTER (WHERE succeeded = false) AS failed_operations
    FROM sync_audit
    WHERE customer_id = $1
      AND processed_at > NOW() - INTERVAL '24 hours'
  `, [customerId]);

  const errorRate = past24h.failed_operations / Math.max(past24h.total_operations, 1);
  const budget = 0.005; // 0.5% error budget

  return {
    errorRate,
    budget,
    burnedBudget: errorRate / budget,
    status: errorRate < budget ? 'healthy' : 'over_budget',
  };
}
Budget burnAction
0-100%No action needed
100-200%Heightened monitoring; investigation
>200%Alert; consider pausing the integration
The pattern accepts that some level of failure is normal while triggering action when failure rates exceed expectations.

A reference resilient integration

The patterns combined:
JavaScript
class ResilientVomoIntegration {
  constructor({ customerId, token }) {
    this.customerId = customerId;
    this.token = token;
    this.client = new ThrottledVomoClient({ token, requestsPerSecond: 3 });
    this.breaker = new CircuitBreaker({ threshold: 5, timeoutMs: 60000 });
  }

  async fetchUserSafely(userId, traceId) {
    return this.breaker.execute(async () => {
      const response = await fetchWithRetry(
        `https://api.vomo.org/v1/users/${userId}`,
        { headers: { Authorization: `Bearer ${this.token}` } },
        5 // max retries
      );

      if (!response.ok) {
        const classification = classifyError(response);
        if (classification.surface) {
          await alertOps({
            severity: 'medium',
            customerId: this.customerId,
            type: 'api_failure',
            status: response.status,
            traceId,
          });
        }
        throw new ApiError(response.status, await response.text());
      }

      return response.json();
    });
  }

  async pollWithRecovery() {
    const traceId = generateTraceId();
    try {
      return await this._poll(traceId);
    } catch (err) {
      logger.error('Poll cycle failed', {
        customerId: this.customerId,
        traceId,
        error: err.message,
      });

      // Don't advance checkpoint on systemic failures
      // Next cycle will re-attempt from the same point
      throw err;
    }
  }

  async processRecordWithDlq(user, traceId) {
    try {
      await processUser(user);
    } catch (err) {
      const classification = classifyError(err.response, err);

      if (classification.surface) {
        // Don't retry validation errors etc.; surface immediately
        await deadLetterQueue.publish({
          customerId: this.customerId,
          resourceType: 'user',
          recordId: user.id,
          payload: user,
          errorClass: classification.class,
          errorMessage: err.message,
          traceId,
          requiresReview: true,
        });
      } else {
        // Transient — DLQ for retry
        await deadLetterQueue.publish({
          customerId: this.customerId,
          resourceType: 'user',
          recordId: user.id,
          payload: user,
          errorClass: classification.class,
          errorMessage: err.message,
          traceId,
        });
      }
    }
  }
}
The patterns layered: retry → circuit breaker → DLQ → alerting. Each layer handles failures the previous can’t, and together they produce an integration that runs continuously through transient and persistent issues alike.

Where to go next

Sync Architecture Patterns

The broader architectural patterns these recovery practices fit into.

API Performance Tips

The performance patterns these resilience patterns coexist with.

Data Modeling

The data model that supports the DLQ and audit log patterns.

Change Detection Best Practices

The change-detection reliability practices this builds on.
Last modified on May 22, 2026