Error Recovery Patterns - Virtuous API Docs

A production partner integration will encounter errors continuously. Some are momentary network blips that resolve on retry. Some are transient platform issues that resolve in minutes. Some are persistent problems that require human intervention. A well-designed integration handles each class correctly without operator attention; a poorly-designed one either gives up too quickly (losing data) or retries forever (burying real problems in noise). This page covers the patterns for getting error recovery right: classifying failures, retrying intelligently, breaking circuits to prevent cascade, and surfacing the right signals to humans.

The recovery mental model

Three categories of failure matter:

Category	Symptom	Recovery
Transient	A momentary network glitch, a temporary platform hiccup, a brief rate-limit spike	Retry — usually succeeds within a few attempts
Persistent-recoverable	An expired credential, a temporarily-unreachable endpoint, sustained rate limiting	Pause and alert — succeeds after human intervention
Permanent	A malformed request, a missing required field, a deleted resource	Stop and surface — never resolves on retry alone

The most common integration failure mode isn’t lacking retry logic — it’s treating all three categories the same. Retrying a permanent failure produces no recovery and just adds noise. Stopping on a transient failure loses data unnecessarily. Get the classification right and the rest of recovery falls into place.

Classifying errors

The HTTP status code is the primary classification signal:

Status	Category	Why
`2xx`	Success	No error.
`400 Bad Request`	Permanent	The request is malformed; retrying without changes won’t help.
`401 Unauthorized`	Persistent-recoverable	Credentials are invalid or expired. Human intervention needed to refresh.
`403 Forbidden`	Permanent	Permissions issue. Retrying won’t help; the API key needs different permissions.
`404 Not Found`	Context-dependent	If the resource never existed: permanent. If it was just created and indexing is delayed: transient.
`409 Conflict`	Context-dependent	Concurrent update or unique-constraint violation. Sometimes retryable, sometimes permanent.
`422 Unprocessable Entity`	Permanent	Validation failed. Same request will fail again.
`429 Too Many Requests`	Transient	Rate limit exceeded. Backs off and succeeds.
`500 Internal Server Error`	Transient	Server-side issue. Usually resolves on retry.
`502 Bad Gateway`	Transient	Network or proxy issue. Resolves on retry.
`503 Service Unavailable`	Transient	Server overloaded or maintenance. Resolves on retry.
`504 Gateway Timeout`	Transient	Upstream timeout. Resolves on retry.
Network error (DNS, connection refused, timeout)	Transient	Resolves on retry.

The 404 and 409 edge cases

The two ambiguous codes deserve explicit handling: 404 after creation. If you just created a resource and look it up moments later, you may briefly get 404 while indexing catches up. Treat as transient with a small retry budget (3–5 attempts over 30 seconds). After that, treat as permanent — the resource probably wasn’t created successfully. 409 Conflict on uniqueness. A unique-constraint violation (e.g., trying to create a Contact with an email that’s already on another Contact) is permanent — retrying produces the same error. A 409 from concurrent modification (e.g., two writers updating the same record at the same instant) is transient and a retry on a fresh GET-then-PUT cycle usually succeeds.

JavaScript

function classifyError(status, error, attemptCount) {
  if (status >= 200 && status < 300) return 'success';

  if (status === 401) return 'persistent_recoverable';   // credential issue
  if (status === 403) return 'permanent';                // permission issue
  if (status === 404) {
    return attemptCount < 3 ? 'transient' : 'permanent';  // indexing delay vs. real miss
  }
  if (status === 409) {
    return error?.code === 'concurrent_modification' ? 'transient' : 'permanent';
  }

  if (status >= 400 && status < 500) return 'permanent'; // 4xx default
  if (status === 429) return 'transient';                 // rate limit
  if (status >= 500) return 'transient';                  // 5xx default

  // Network errors
  if (error?.code === 'ECONNREFUSED' || error?.code === 'ETIMEDOUT') return 'transient';

  return 'transient';                                     // default to retry
}

Retry with exponential backoff and jitter

For transient errors, retry with a delay that grows on each attempt. Three properties matter:

Property	Purpose
Exponential growth	Quick first retry (resolves momentary glitches), longer subsequent retries (allow real outages to recover)
Jitter	Avoid synchronized retries from many clients hitting the same endpoint simultaneously
Max attempts	Bound the work — eventually classify as persistent-recoverable and escalate

JavaScript

async function retryWithBackoff(fn, options = {}) {
  const {
    maxAttempts = 5,
    baseDelayMs = 1000,
    maxDelayMs = 60000,
    jitterFactor = 0.5,
  } = options;

  let lastError;
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      const category = classifyError(err.status, err.body, attempt);

      if (category === 'permanent') throw err;             // Don't retry
      if (category === 'persistent_recoverable') throw err; // Escalate, don't retry
      if (attempt === maxAttempts) throw err;

      // Exponential backoff with jitter
      const exponentialDelay = Math.min(baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs);
      const jitter = exponentialDelay * jitterFactor * Math.random();
      const delay = exponentialDelay + jitter;

      console.warn(`Attempt ${attempt} failed, retrying in ${Math.round(delay)}ms`, { error: err.message });
      await sleep(delay);
    }
  }
  throw lastError;
}

A typical schedule (base 1s, max 60s, factor 0.5, 5 attempts):

Attempt	Delay before retry
1 → 2	1–1.5s
2 → 3	2–3s
3 → 4	4–6s
4 → 5	8–12s

Total time before giving up: 15–22 seconds. Good for inline operations; for background work, increase maxAttempts and maxDelayMs for a longer total budget.

Honor `Retry-After` for 429

When the server tells you when to retry, listen:

JavaScript

async function makeRequest(url, options) {
  const response = await fetch(url, options);
  if (response.status === 429) {
    const retryAfterSeconds = parseInt(response.headers.get('Retry-After') || '60', 10);
    const err = new Error('Rate limited');
    err.status = 429;
    err.retryAfterMs = retryAfterSeconds * 1000;
    throw err;
  }
  return response;
}

async function retryHonoringRetryAfter(fn) {
  while (true) {
    try {
      return await fn();
    } catch (err) {
      if (err.status === 429 && err.retryAfterMs) {
        await sleep(err.retryAfterMs);
        continue;
      }
      throw err;
    }
  }
}

The Retry-After header is more accurate than exponential backoff for the rate-limit case — the server knows exactly when the window resets.

Differentiate idempotent from non-idempotent operations

Retrying a GET is safe — the same response comes back. Retrying a POST that may have already partially succeeded is risky — you could create duplicate resources. The dividing line:

Operation	Idempotent if…
`GET`	Always
`POST /api/v2/Gift/Transaction`	The submission carries a stable `transactionSource` + `transactionId`
`POST /api/Contact/Transaction`	The submission carries a stable `referenceSource` + `referenceId`
`POST /api/Contact` (direct)	Never — each call creates a new Contact
`PUT /api/Contact/{id}`	Always — same body produces same final state
`DELETE /api/Gift/{id}`	After first success, subsequent calls 404 — typically safe

For non-idempotent operations, the safer pattern is: don’t retry blindly. Instead, on a network error or timeout, check whether the operation actually completed before retrying.

JavaScript

async function createContactDirectWithRecovery(donor) {
  try {
    const response = await fetch('https://api.virtuoussoftware.com/api/Contact', {
      method: 'POST',
      headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' },
      body: JSON.stringify(donor),
    });
    return await response.json();
  } catch (err) {
    if (err.code === 'ETIMEDOUT' || err.code === 'ECONNRESET') {
      // The request may have completed server-side even though we didn't get a response.
      // Check before retrying.
      const existing = await findContactByReference(donor.referenceSource, donor.referenceId);
      if (existing) return existing;
    }
    throw err;
  }
}

This pattern prevents the most common partner-integration duplicate cause: a timeout-then-retry sequence creating two records when the first request actually succeeded.

Dead letter queue for permanent failures

Records that exhaust retries and are classified as permanent failures should not silently disappear. Route them to a dead letter queue — a separate store for inspection and possible manual replay.

CREATE TABLE virtuous_dead_letter_queue (
  id BIGSERIAL PRIMARY KEY,
  customer_id TEXT NOT NULL,
  original_record_id TEXT NOT NULL,
  operation_type TEXT NOT NULL,                -- 'create_contact' | 'create_gift' | etc.
  original_payload JSONB NOT NULL,
  failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  status_code INTEGER,
  error_message TEXT,
  attempt_count INTEGER NOT NULL,
  human_reviewed_at TIMESTAMPTZ,
  resolution TEXT,                              -- 'replayed' | 'discarded' | 'fixed_and_replayed'
  resolution_notes TEXT
);

The dead letter queue serves three purposes:

Visibility. Failures don’t disappear into a log file; they have a structured record.
Replay. After fixing the underlying problem, the original payload can be re-submitted.
Audit. A persistent record of what failed and why supports later investigation.

Producing entries

JavaScript

async function processWithDeadLetter(record, processFn) {
  try {
    await retryWithBackoff(() => processFn(record));
  } catch (err) {
    const category = classifyError(err.status, err.body);
    if (category === 'permanent' || category === 'persistent_recoverable') {
      await db.virtuous_dead_letter_queue.insert({
        customer_id: record.customer_id,
        original_record_id: record.id,
        operation_type: record.operation_type,
        original_payload: record.payload,
        status_code: err.status,
        error_message: err.message,
        attempt_count: err.attemptCount,
      });
    }
    throw err;
  }
}

Reviewing and replaying

Dead-letter entries need human review. Surface them in your integration’s admin UI with replay actions:

Replay as-is — re-submit the original payload. Useful after a transient infrastructure fix.
Fix and replay — edit the payload (correct a field, change a Project code) and resubmit.
Discard — mark the record as a known-bad case that shouldn’t be retried.

A growing dead-letter queue is a signal to investigate. Set up a daily review process or alerting threshold so it doesn’t grow unboundedly.

Circuit breakers

When a downstream system is failing repeatedly, continuing to send requests just produces more failures and consumes resources. A circuit breaker stops sending requests after sustained failure and lets the downstream recover.

The three states

State	Behavior
Closed	Normal operation. Requests pass through. Track failures.
Open	Too many recent failures. Requests fail immediately without calling the downstream.
Half-open	Cool-down has elapsed. Allow one test request through to see if downstream is healthy.

Implementation

JavaScript

class CircuitBreaker {
  constructor({ failureThreshold = 5, openDurationMs = 60000 }) {
    this.failureThreshold = failureThreshold;
    this.openDurationMs = openDurationMs;
    this.state = 'closed';
    this.failureCount = 0;
    this.openedAt = null;
  }

  async execute(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.openDurationMs) {
        this.state = 'half_open';
      } else {
        throw new Error('Circuit breaker open');
      }
    }

    try {
      const result = await fn();
      if (this.state === 'half_open') {
        this.state = 'closed';
        this.failureCount = 0;
      }
      return result;
    } catch (err) {
      if (this.state === 'half_open') {
        this.state = 'open';
        this.openedAt = Date.now();
        throw err;
      }
      this.failureCount++;
      if (this.failureCount >= this.failureThreshold) {
        this.state = 'open';
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}

Use one circuit breaker per logical downstream — one for Virtuous overall, separate ones per third-party integration. A failing Virtuous endpoint shouldn’t trigger a circuit breaker that blocks your Mailchimp sync.

Tuning

The right threshold depends on the integration’s traffic:

High-traffic (hundreds of requests/minute): higher threshold (e.g., 20 failures), shorter open duration (30 seconds).
Low-traffic (a few requests/minute): lower threshold (e.g., 5 failures), longer open duration (5 minutes).

Too sensitive a circuit breaker (low threshold, short window) flaps under normal transient noise. Too insensitive (high threshold, long window) lets cascading failures continue.

Idempotency as the recovery enabler

Idempotency isn’t optional in a recovery-aware integration — it’s the precondition that makes retries safe. For partner integrations writing to Virtuous, the idempotency mechanism is:

Contacts: referenceSource + referenceId. The matching algorithm uses these to find existing records on a retry.
Gifts: transactionSource + transactionId. Same pattern.
RecurringGifts: transactionSource + transactionId. Same pattern.

Use stable, deterministic values for these — never random per-attempt UUIDs. See Idempotency and Safe Reprocessing for the underlying mechanism and Handle Duplicate Records — Use stable transactionId for Gifts for the prevention pattern.

Reconciliation as the safety net

Retries handle transient failures. Dead-letter queues handle permanent failures. But there’s a third category: failures that produced no error but still left the system in an inconsistent state — a webhook delivery that exhausted its retry budget, a request that timed out but actually succeeded, a record updated on one side but not the other. Reconciliation is the safety net for these. Periodically compare the partner-side and Virtuous-side states for resources you sync. Surface discrepancies for action. See Reconcile Failed Syncs for the full pattern. The key insight for this page: assume your retry logic is imperfect, and design a reconciliation pass that doesn’t depend on it being correct.

Observability for recovery debugging

Errors that aren’t observed are errors that can’t be fixed. Three observability practices make recovery investigations tractable:

Structured logging

Every error log should include enough context to investigate later:

JavaScript

console.error('Gift submission failed', {
  customer_id: customer.id,
  transaction_id: gift.transactionId,
  status_code: err.status,
  error_message: err.message,
  attempt: attemptCount,
  request_id: response.headers.get('X-Request-Id'),
});

The key fields:

The customer identifier (for multi-tenant integrations).
The record identifier (for traceability).
The Virtuous request ID if returned in response headers (lets Virtuous engineering correlate against their logs).
The attempt count (for retry analysis).

Metrics on failure categories

Track separate counters for each failure category:

Counter	What it tells you
`virtuous_request_total`	Overall request volume
`virtuous_request_failures{category="transient"}`	Retry-driven noise; expected to be > 0
`virtuous_request_failures{category="persistent_recoverable"}`	Credentials and infrastructure issues; should be near zero
`virtuous_request_failures{category="permanent"}`	Data quality and bug issues; investigation target
`virtuous_dead_letter_queue_depth`	Backlog of unresolved permanent failures
`virtuous_circuit_breaker_state{breaker="virtuous_main"}`	Per-breaker state for alerting

Alert on persistent-recoverable failures and on dead-letter-queue depth growth. Transient failures shouldn’t page anyone — they’re expected.

Tracing where supported

For complex sync workflows that span multiple services, distributed tracing (OpenTelemetry, etc.) makes failure investigation dramatically easier. A trace shows the path of a single record through your queue, submitter, Virtuous API, and back through the webhook receiver — surfacing where in the path the failure occurred.

Operational practices

Three practices keep a recovery system healthy over time:

Periodic recovery testing

In a staging environment, deliberately inject failures and confirm the recovery paths work:

Block your integration’s network for a minute and confirm requests are retried successfully.
Return 429 from a mocked endpoint and confirm the rate-limit pause works.
Send malformed payloads and confirm dead-letter routing works.

Production behavior under failure is predictable only if you’ve tested the failure paths.

Dead-letter queue review cadence

Schedule a regular (weekly or daily) review of dead-letter entries. Decide for each: replay, fix, or discard. An ignored dead-letter queue silently accumulates unresolved problems that compound over time.

Failure-rate baseline

Establish what “normal” looks like for each error category. A transient failure rate of 0.5% is probably fine; a sudden jump to 5% needs investigation. Without a baseline, you can’t tell normal from anomalous.

Where to go next

API Performance Tips

The performance practices that complement these recovery patterns.

Reconcile Failed Syncs

The reconciliation safety net that catches what retries miss.

Idempotency and Safe Reprocessing

The precondition that makes retries safe.

Error Handling

The error-envelope reference that the classification on this page builds on.

​The recovery mental model

​Classifying errors

​The 404 and 409 edge cases

​Retry with exponential backoff and jitter

​Honor Retry-After for 429

​Differentiate idempotent from non-idempotent operations

​Dead letter queue for permanent failures

​Producing entries

​Reviewing and replaying

​Circuit breakers

​The three states

​Implementation

​Tuning

​Idempotency as the recovery enabler

​Reconciliation as the safety net

​Observability for recovery debugging

​Structured logging

​Metrics on failure categories

​Tracing where supported

​Operational practices

​Periodic recovery testing

​Dead-letter queue review cadence

​Failure-rate baseline

​Where to go next

API Performance Tips

Reconcile Failed Syncs

Idempotency and Safe Reprocessing

Error Handling

The recovery mental model

Classifying errors

The 404 and 409 edge cases

Retry with exponential backoff and jitter

Honor `Retry-After` for 429

Differentiate idempotent from non-idempotent operations

Dead letter queue for permanent failures

Producing entries

Reviewing and replaying

Circuit breakers

The three states

Implementation

Tuning

Idempotency as the recovery enabler

Reconciliation as the safety net

Observability for recovery debugging

Structured logging

Metrics on failure categories

Tracing where supported

Operational practices

Periodic recovery testing

Dead-letter queue review cadence

Failure-rate baseline

Where to go next