Skip to main content
A production partner integration will encounter errors continuously. Some are momentary network blips that resolve on retry. Some are transient platform issues that resolve in minutes. Some are persistent problems that require human intervention. A well-designed integration handles each class correctly without operator attention; a poorly-designed one either gives up too quickly (losing data) or retries forever (burying real problems in noise). This page covers the patterns for getting error recovery right: classifying failures, retrying intelligently, breaking circuits to prevent cascade, and surfacing the right signals to humans.

The recovery mental model

Three categories of failure matter:
CategorySymptomRecovery
TransientA momentary network glitch, a temporary platform hiccup, a brief rate-limit spikeRetry — usually succeeds within a few attempts
Persistent-recoverableAn expired credential, a temporarily-unreachable endpoint, sustained rate limitingPause and alert — succeeds after human intervention
PermanentA malformed request, a missing required field, a deleted resourceStop and surface — never resolves on retry alone
The most common integration failure mode isn’t lacking retry logic — it’s treating all three categories the same. Retrying a permanent failure produces no recovery and just adds noise. Stopping on a transient failure loses data unnecessarily. Get the classification right and the rest of recovery falls into place.

Classifying errors

The HTTP status code is the primary classification signal:
StatusCategoryWhy
2xxSuccessNo error.
400 Bad RequestPermanentThe request is malformed; retrying without changes won’t help.
401 UnauthorizedPersistent-recoverableCredentials are invalid or expired. Human intervention needed to refresh.
403 ForbiddenPermanentPermissions issue. Retrying won’t help; the API key needs different permissions.
404 Not FoundContext-dependentIf the resource never existed: permanent. If it was just created and indexing is delayed: transient.
409 ConflictContext-dependentConcurrent update or unique-constraint violation. Sometimes retryable, sometimes permanent.
422 Unprocessable EntityPermanentValidation failed. Same request will fail again.
429 Too Many RequestsTransientRate limit exceeded. Backs off and succeeds.
500 Internal Server ErrorTransientServer-side issue. Usually resolves on retry.
502 Bad GatewayTransientNetwork or proxy issue. Resolves on retry.
503 Service UnavailableTransientServer overloaded or maintenance. Resolves on retry.
504 Gateway TimeoutTransientUpstream timeout. Resolves on retry.
Network error (DNS, connection refused, timeout)TransientResolves on retry.

The 404 and 409 edge cases

The two ambiguous codes deserve explicit handling: 404 after creation. If you just created a resource and look it up moments later, you may briefly get 404 while indexing catches up. Treat as transient with a small retry budget (3–5 attempts over 30 seconds). After that, treat as permanent — the resource probably wasn’t created successfully. 409 Conflict on uniqueness. A unique-constraint violation (e.g., trying to create a Contact with an email that’s already on another Contact) is permanent — retrying produces the same error. A 409 from concurrent modification (e.g., two writers updating the same record at the same instant) is transient and a retry on a fresh GET-then-PUT cycle usually succeeds.
JavaScript
function classifyError(status, error, attemptCount) {
  if (status >= 200 && status < 300) return 'success';

  if (status === 401) return 'persistent_recoverable';   // credential issue
  if (status === 403) return 'permanent';                // permission issue
  if (status === 404) {
    return attemptCount < 3 ? 'transient' : 'permanent';  // indexing delay vs. real miss
  }
  if (status === 409) {
    return error?.code === 'concurrent_modification' ? 'transient' : 'permanent';
  }

  if (status >= 400 && status < 500) return 'permanent'; // 4xx default
  if (status === 429) return 'transient';                 // rate limit
  if (status >= 500) return 'transient';                  // 5xx default

  // Network errors
  if (error?.code === 'ECONNREFUSED' || error?.code === 'ETIMEDOUT') return 'transient';

  return 'transient';                                     // default to retry
}

Retry with exponential backoff and jitter

For transient errors, retry with a delay that grows on each attempt. Three properties matter:
PropertyPurpose
Exponential growthQuick first retry (resolves momentary glitches), longer subsequent retries (allow real outages to recover)
JitterAvoid synchronized retries from many clients hitting the same endpoint simultaneously
Max attemptsBound the work — eventually classify as persistent-recoverable and escalate
JavaScript
async function retryWithBackoff(fn, options = {}) {
  const {
    maxAttempts = 5,
    baseDelayMs = 1000,
    maxDelayMs = 60000,
    jitterFactor = 0.5,
  } = options;

  let lastError;
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      const category = classifyError(err.status, err.body, attempt);

      if (category === 'permanent') throw err;             // Don't retry
      if (category === 'persistent_recoverable') throw err; // Escalate, don't retry
      if (attempt === maxAttempts) throw err;

      // Exponential backoff with jitter
      const exponentialDelay = Math.min(baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs);
      const jitter = exponentialDelay * jitterFactor * Math.random();
      const delay = exponentialDelay + jitter;

      console.warn(`Attempt ${attempt} failed, retrying in ${Math.round(delay)}ms`, { error: err.message });
      await sleep(delay);
    }
  }
  throw lastError;
}
A typical schedule (base 1s, max 60s, factor 0.5, 5 attempts):
AttemptDelay before retry
1 → 21–1.5s
2 → 32–3s
3 → 44–6s
4 → 58–12s
Total time before giving up: 15–22 seconds. Good for inline operations; for background work, increase maxAttempts and maxDelayMs for a longer total budget.

Honor Retry-After for 429

When the server tells you when to retry, listen:
JavaScript
async function makeRequest(url, options) {
  const response = await fetch(url, options);
  if (response.status === 429) {
    const retryAfterSeconds = parseInt(response.headers.get('Retry-After') || '60', 10);
    const err = new Error('Rate limited');
    err.status = 429;
    err.retryAfterMs = retryAfterSeconds * 1000;
    throw err;
  }
  return response;
}

async function retryHonoringRetryAfter(fn) {
  while (true) {
    try {
      return await fn();
    } catch (err) {
      if (err.status === 429 && err.retryAfterMs) {
        await sleep(err.retryAfterMs);
        continue;
      }
      throw err;
    }
  }
}
The Retry-After header is more accurate than exponential backoff for the rate-limit case — the server knows exactly when the window resets.

Differentiate idempotent from non-idempotent operations

Retrying a GET is safe — the same response comes back. Retrying a POST that may have already partially succeeded is risky — you could create duplicate resources. The dividing line:
OperationIdempotent if…
GETAlways
POST /api/v2/Gift/TransactionThe submission carries a stable transactionSource + transactionId
POST /api/Contact/TransactionThe submission carries a stable referenceSource + referenceId
POST /api/Contact (direct)Never — each call creates a new Contact
PUT /api/Contact/{id}Always — same body produces same final state
DELETE /api/Gift/{id}After first success, subsequent calls 404 — typically safe
For non-idempotent operations, the safer pattern is: don’t retry blindly. Instead, on a network error or timeout, check whether the operation actually completed before retrying.
JavaScript
async function createContactDirectWithRecovery(donor) {
  try {
    const response = await fetch('https://api.virtuoussoftware.com/api/Contact', {
      method: 'POST',
      headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' },
      body: JSON.stringify(donor),
    });
    return await response.json();
  } catch (err) {
    if (err.code === 'ETIMEDOUT' || err.code === 'ECONNRESET') {
      // The request may have completed server-side even though we didn't get a response.
      // Check before retrying.
      const existing = await findContactByReference(donor.referenceSource, donor.referenceId);
      if (existing) return existing;
    }
    throw err;
  }
}
This pattern prevents the most common partner-integration duplicate cause: a timeout-then-retry sequence creating two records when the first request actually succeeded.

Dead letter queue for permanent failures

Records that exhaust retries and are classified as permanent failures should not silently disappear. Route them to a dead letter queue — a separate store for inspection and possible manual replay.
CREATE TABLE virtuous_dead_letter_queue (
  id BIGSERIAL PRIMARY KEY,
  customer_id TEXT NOT NULL,
  original_record_id TEXT NOT NULL,
  operation_type TEXT NOT NULL,                -- 'create_contact' | 'create_gift' | etc.
  original_payload JSONB NOT NULL,
  failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  status_code INTEGER,
  error_message TEXT,
  attempt_count INTEGER NOT NULL,
  human_reviewed_at TIMESTAMPTZ,
  resolution TEXT,                              -- 'replayed' | 'discarded' | 'fixed_and_replayed'
  resolution_notes TEXT
);
The dead letter queue serves three purposes:
  • Visibility. Failures don’t disappear into a log file; they have a structured record.
  • Replay. After fixing the underlying problem, the original payload can be re-submitted.
  • Audit. A persistent record of what failed and why supports later investigation.

Producing entries

JavaScript
async function processWithDeadLetter(record, processFn) {
  try {
    await retryWithBackoff(() => processFn(record));
  } catch (err) {
    const category = classifyError(err.status, err.body);
    if (category === 'permanent' || category === 'persistent_recoverable') {
      await db.virtuous_dead_letter_queue.insert({
        customer_id: record.customer_id,
        original_record_id: record.id,
        operation_type: record.operation_type,
        original_payload: record.payload,
        status_code: err.status,
        error_message: err.message,
        attempt_count: err.attemptCount,
      });
    }
    throw err;
  }
}

Reviewing and replaying

Dead-letter entries need human review. Surface them in your integration’s admin UI with replay actions:
  • Replay as-is — re-submit the original payload. Useful after a transient infrastructure fix.
  • Fix and replay — edit the payload (correct a field, change a Project code) and resubmit.
  • Discard — mark the record as a known-bad case that shouldn’t be retried.
A growing dead-letter queue is a signal to investigate. Set up a daily review process or alerting threshold so it doesn’t grow unboundedly.

Circuit breakers

When a downstream system is failing repeatedly, continuing to send requests just produces more failures and consumes resources. A circuit breaker stops sending requests after sustained failure and lets the downstream recover.

The three states

StateBehavior
ClosedNormal operation. Requests pass through. Track failures.
OpenToo many recent failures. Requests fail immediately without calling the downstream.
Half-openCool-down has elapsed. Allow one test request through to see if downstream is healthy.

Implementation

JavaScript
class CircuitBreaker {
  constructor({ failureThreshold = 5, openDurationMs = 60000 }) {
    this.failureThreshold = failureThreshold;
    this.openDurationMs = openDurationMs;
    this.state = 'closed';
    this.failureCount = 0;
    this.openedAt = null;
  }

  async execute(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.openDurationMs) {
        this.state = 'half_open';
      } else {
        throw new Error('Circuit breaker open');
      }
    }

    try {
      const result = await fn();
      if (this.state === 'half_open') {
        this.state = 'closed';
        this.failureCount = 0;
      }
      return result;
    } catch (err) {
      if (this.state === 'half_open') {
        this.state = 'open';
        this.openedAt = Date.now();
        throw err;
      }
      this.failureCount++;
      if (this.failureCount >= this.failureThreshold) {
        this.state = 'open';
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}
Use one circuit breaker per logical downstream — one for Virtuous overall, separate ones per third-party integration. A failing Virtuous endpoint shouldn’t trigger a circuit breaker that blocks your Mailchimp sync.

Tuning

The right threshold depends on the integration’s traffic:
  • High-traffic (hundreds of requests/minute): higher threshold (e.g., 20 failures), shorter open duration (30 seconds).
  • Low-traffic (a few requests/minute): lower threshold (e.g., 5 failures), longer open duration (5 minutes).
Too sensitive a circuit breaker (low threshold, short window) flaps under normal transient noise. Too insensitive (high threshold, long window) lets cascading failures continue.

Idempotency as the recovery enabler

Idempotency isn’t optional in a recovery-aware integration — it’s the precondition that makes retries safe. For partner integrations writing to Virtuous, the idempotency mechanism is:
  • Contacts: referenceSource + referenceId. The matching algorithm uses these to find existing records on a retry.
  • Gifts: transactionSource + transactionId. Same pattern.
  • RecurringGifts: transactionSource + transactionId. Same pattern.
Use stable, deterministic values for these — never random per-attempt UUIDs. See Idempotency and Safe Reprocessing for the underlying mechanism and Handle Duplicate Records — Use stable transactionId for Gifts for the prevention pattern.

Reconciliation as the safety net

Retries handle transient failures. Dead-letter queues handle permanent failures. But there’s a third category: failures that produced no error but still left the system in an inconsistent state — a webhook delivery that exhausted its retry budget, a request that timed out but actually succeeded, a record updated on one side but not the other. Reconciliation is the safety net for these. Periodically compare the partner-side and Virtuous-side states for resources you sync. Surface discrepancies for action. See Reconcile Failed Syncs for the full pattern. The key insight for this page: assume your retry logic is imperfect, and design a reconciliation pass that doesn’t depend on it being correct.

Observability for recovery debugging

Errors that aren’t observed are errors that can’t be fixed. Three observability practices make recovery investigations tractable:

Structured logging

Every error log should include enough context to investigate later:
JavaScript
console.error('Gift submission failed', {
  customer_id: customer.id,
  transaction_id: gift.transactionId,
  status_code: err.status,
  error_message: err.message,
  attempt: attemptCount,
  request_id: response.headers.get('X-Request-Id'),
});
The key fields:
  • The customer identifier (for multi-tenant integrations).
  • The record identifier (for traceability).
  • The Virtuous request ID if returned in response headers (lets Virtuous engineering correlate against their logs).
  • The attempt count (for retry analysis).

Metrics on failure categories

Track separate counters for each failure category:
CounterWhat it tells you
virtuous_request_totalOverall request volume
virtuous_request_failures{category="transient"}Retry-driven noise; expected to be > 0
virtuous_request_failures{category="persistent_recoverable"}Credentials and infrastructure issues; should be near zero
virtuous_request_failures{category="permanent"}Data quality and bug issues; investigation target
virtuous_dead_letter_queue_depthBacklog of unresolved permanent failures
virtuous_circuit_breaker_state{breaker="virtuous_main"}Per-breaker state for alerting
Alert on persistent-recoverable failures and on dead-letter-queue depth growth. Transient failures shouldn’t page anyone — they’re expected.

Tracing where supported

For complex sync workflows that span multiple services, distributed tracing (OpenTelemetry, etc.) makes failure investigation dramatically easier. A trace shows the path of a single record through your queue, submitter, Virtuous API, and back through the webhook receiver — surfacing where in the path the failure occurred.

Operational practices

Three practices keep a recovery system healthy over time:

Periodic recovery testing

In a staging environment, deliberately inject failures and confirm the recovery paths work:
  • Block your integration’s network for a minute and confirm requests are retried successfully.
  • Return 429 from a mocked endpoint and confirm the rate-limit pause works.
  • Send malformed payloads and confirm dead-letter routing works.
Production behavior under failure is predictable only if you’ve tested the failure paths.

Dead-letter queue review cadence

Schedule a regular (weekly or daily) review of dead-letter entries. Decide for each: replay, fix, or discard. An ignored dead-letter queue silently accumulates unresolved problems that compound over time.

Failure-rate baseline

Establish what “normal” looks like for each error category. A transient failure rate of 0.5% is probably fine; a sudden jump to 5% needs investigation. Without a baseline, you can’t tell normal from anomalous.

Where to go next

API Performance Tips

The performance practices that complement these recovery patterns.

Reconcile Failed Syncs

The reconciliation safety net that catches what retries miss.

Idempotency and Safe Reprocessing

The precondition that makes retries safe.

Error Handling

The error-envelope reference that the classification on this page builds on.
Last modified on May 21, 2026