Error Recovery Patterns - Virtuous API Docs

Every Raise integration eventually encounters errors. Some are transient (gateway timeouts, network blips, brief rate-limit windows) and recover on retry. Some are permanent (validation failures, deleted records, revoked credentials) and require human intervention. The integrations that handle both cases well — distinguishing between them, retrying transient errors appropriately, surfacing permanent ones for review without spamming alerts — are the ones that stay up under production load. This page covers the classification framework, the retry patterns with backoff, circuit breakers for cascading failures, dead-letter queues for permanent failures, and the special considerations for POST /api/Raise/give (which charges payment methods and can’t be retried naively).

Classifying errors

The first and most important decision: is this error transient or permanent? The right response is wildly different.

Error class	Examples	Right response
Permanent client	`400` validation, `404` not found, `403` forbidden	Don’t retry. The request itself needs to change.
Persistent-recoverable	`401` unauthorized	Don’t retry. The credential needs to be refreshed.
Transient	`429`, `500`, `502`, `503`, `504`, network errors, TLS errors	Retry with exponential backoff.
Ambiguous	Some `4xx` codes, occasional weird responses	Treat conservatively — assume permanent unless evidence suggests otherwise.

The classifier is the foundation of all retry logic. Get it right, and everything else falls into place. Get it wrong, and you either hammer the API with futile retries or fail to retry transient errors that would succeed.

A reference classifier

JavaScript

function classifyError(status, problem) {
  // Network or TLS errors (status undefined) — always transient
  if (!status) return 'transient';

  // Success
  if (status >= 200 && status < 300) return 'success';

  // 3xx — typically not seen in API responses; treat as transient
  if (status >= 300 && status < 400) return 'transient';

  // 4xx — client errors
  if (status === 401) return 'persistent_recoverable'; // Credential
  if (status === 403) return 'permanent_client';
  if (status === 404) return 'permanent_client';
  if (status === 408) return 'transient'; // Request timeout
  if (status === 409) return 'permanent_client'; // Conflict
  if (status === 422) return 'permanent_client'; // Unprocessable entity
  if (status === 429) return 'transient'; // Rate limited
  if (status >= 400 && status < 500) {
    // 400 and other 4xx — typically validation failures
    return problem?.errors ? 'permanent_client' : 'permanent_client';
  }

  // 5xx — server errors, all transient
  if (status >= 500) return 'transient';

  // Unknown — be cautious
  return 'permanent_client';
}

This is a starting point; tune it based on what your integration actually encounters. Log unexpected classifications so you can refine the rules.

Retry patterns for transient errors

Transient errors are the bread-and-butter case for retry logic. The patterns:

Exponential backoff

The basic pattern: each retry waits longer than the previous, eventually giving up.

JavaScript

async function callWithRetry(url, options, maxAttempts = 5) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const response = await fetch(url, options);

    if (response.ok) return response;

    const classification = classifyError(response.status, await parseProblem(response));

    if (classification !== 'transient') {
      // Non-retryable — fail fast
      throw makeError(response, classification);
    }

    if (attempt === maxAttempts) {
      throw new Error(`Failed after ${maxAttempts} attempts`);
    }

    // Compute backoff: 1s, 2s, 4s, 8s, 16s (exponential)
    const baseDelay = Math.pow(2, attempt - 1) * 1000;

    // Add jitter to avoid thundering-herd retries
    const jitter = Math.random() * 1000;

    await sleep(baseDelay + jitter);
  }
}

Three things this pattern gets right:

Exponential growth means most retries happen quickly, but persistent failures don’t loop tightly.
Jitter prevents synchronized retries across many integrations from hammering the API simultaneously.
Bounded attempts ensure the retry loop terminates rather than retrying forever.

Honor `Retry-After` when present

For rate-limit responses, the server may include a Retry-After header indicating how long to wait. Honor it instead of computing a backoff:

JavaScript

async function callWithRetry(url, options, maxAttempts = 5) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const response = await fetch(url, options);

    if (response.ok) return response;

    const classification = classifyError(response.status, await parseProblem(response));

    if (classification !== 'transient') {
      throw makeError(response, classification);
    }

    if (attempt === maxAttempts) {
      throw new Error(`Failed after ${maxAttempts} attempts`);
    }

    // Honor Retry-After if the server provided it
    const retryAfter = response.headers.get('Retry-After');
    let delayMs;
    if (retryAfter) {
      delayMs = parseInt(retryAfter, 10) * 1000;
    } else {
      delayMs = Math.pow(2, attempt - 1) * 1000 + Math.random() * 1000;
    }

    await sleep(delayMs);
  }
}

The Raise OpenAPI spec doesn’t explicitly document the Retry-After header on rate-limit responses. The pattern assumes it’s present (following common HTTP convention) and falls back to exponential backoff if not. See Rate Limits for what’s known.

Don’t retry forever

A 5xx that persists for hours isn’t transient — it’s a sustained issue worth surfacing for human review. Bound retries at a reasonable number (typically 5 attempts) and move to a different handling strategy after that. See Dead-letter queues below.

Errors that should never be retried

Three classes of errors that retry only makes worse:

`401 Unauthorized`: the credential is bad

A 401 means the token is invalid, expired, or revoked. Retrying produces the same 401 indefinitely. The fix isn’t a retry — it’s a credential refresh.

JavaScript

if (response.status === 401) {
  await alertOps({
    severity: 'critical',
    customerId,
    message: 'Raise API token is invalid — credential needs refresh',
  });
  await pauseSyncForCustomer(customerId);
  throw new AuthError('Token invalid');
}

Pause the customer’s sync work until a human refreshes the token. Continuing to attempt requests with a bad token just generates noise.

`400` validation failures: the request is bad

A 400 typically means the request body has a validation issue — a missing required field, an invalid value, a malformed structure. Retrying with the same body produces the same 400. The fix is to correct the request:

JavaScript

if (response.status === 400) {
  const problem = await response.json();
  if (problem.errors) {
    // Per-field validation errors
    throw new ValidationError('Request validation failed', problem.errors);
  }
  // Other 400 — payment failure, etc.
  throw new ClientError(problem.detail || problem.title);
}

For partner integrations submitting donations, a 400 from POST /api/Raise/give may also indicate a payment failure (card declined, gateway rejection). These also should not be retried with the same payment method — surface them to the donor for a different card.

`404 Not Found`: the resource doesn’t exist

A 404 from GET /api/Donor/12345 means donor 12345 doesn’t exist (or was deleted). No retry will make it appear. The fix is to handle the absence gracefully:

JavaScript

if (response.status === 404) {
  return null; // Let the caller decide what to do
}

Sometimes a 404 is expected (the integration was checking for existence). Sometimes it indicates a deeper issue (the donor was deleted between the integration learning about them and the lookup). Don’t retry; the right path depends on context.

Special case: `POST /api/Raise/give` retries

Donation submissions deserve special attention because they charge payment methods. A naive retry on a network error can produce double charges if the original request succeeded but the response didn’t reach the integration.

The double-charge risk

The integration sees one successful response. The donor sees two charges. The customer has to issue a refund for one of them. Avoid this.

The defensive pattern

For POST /api/Raise/give specifically, never retry network errors blindly. Instead:

JavaScript

async function submitDonationSafely(donationRequest, customerId) {
  // Generate a client-side tracking ID
  const trackingId = `${customerId}-${donationRequest.donor.email}-${Date.now()}`;

  // Record the intent before submitting
  await donationAttemptStore.recordIntent({
    trackingId,
    customerId,
    amount: donationRequest.amount,
    donorEmail: donationRequest.donor.email,
    submittedAt: new Date(),
    status: 'submitting',
  });

  try {
    const response = await fetch(
      'https://prod-api.raisedonors.com/api/Raise/give',
      {
        method: 'POST',
        headers: { /* ... */ },
        body: JSON.stringify(donationRequest),
      }
    );

    if (response.ok) {
      const gift = await response.json();
      await donationAttemptStore.recordSuccess(trackingId, gift.id);
      return gift;
    }

    // Non-OK response — classify
    const problem = await response.json();
    const classification = classifyError(response.status, problem);

    if (classification === 'permanent_client') {
      // Card declined, validation failed, etc. — don't retry
      await donationAttemptStore.recordFailure(trackingId, 'permanent', problem);
      throw new DonationError(problem.detail, response.status, problem);
    }

    // Transient — but don't retry blindly
    await donationAttemptStore.recordFailure(trackingId, 'transient', problem);
    throw new DonationError(`Donation submission failed: ${problem.detail}`, response.status);

  } catch (err) {
    if (err instanceof DonationError) throw err;

    // Network error — uncertain whether the donation went through
    await donationAttemptStore.recordFailure(trackingId, 'uncertain', { error: err.message });
    throw new UncertainDonationError(
      'Donation may or may not have been processed — requires reconciliation',
      err
    );
  }
}

Reconciling uncertain donations

When a network error leaves the outcome uncertain, the integration shouldn’t auto-retry. Instead, surface the uncertain donation for reconciliation:

JavaScript

async function reconcileUncertainDonations() {
  const uncertain = await donationAttemptStore.findUncertain();

  for (const attempt of uncertain) {
    // Look for matching gifts in Raise from around the attempt time
    const candidates = await fetch(
      'https://prod-api.raisedonors.com/api/Gift/query',
      {
        method: 'POST',
        headers: { /* ... */ },
        body: JSON.stringify({
          skip: 0,
          take: 10,
          groups: [
            {
              conditions: [
                { parameter: 'donorEmail', operator: EQUALS, value: attempt.donorEmail },
                { parameter: 'amount', operator: EQUALS, value: attempt.amount.toString() },
                { parameter: 'date', operator: GT_OPERATOR, value: attempt.submittedAt },
              ],
              conjunct: AND_CONJUNCT,
            },
          ],
        }),
      }
    ).then((r) => r.json());

    if (candidates.items.length > 0) {
      // Match found — the donation did go through
      await donationAttemptStore.recordSuccess(attempt.trackingId, candidates.items[0].id);
    } else {
      // No match — the donation did not go through; safe to retry
      await donationAttemptStore.markRetryable(attempt.trackingId);
    }
  }
}

Run this on a short cadence (every few minutes) to resolve uncertain attempts. Only after confirming the original attempt didn’t go through is it safe to retry.

This is the recommended pattern only because the Raise spec doesn’t currently document an idempotency-key header that would solve the problem more elegantly. When such a header becomes available, use it instead — it’s a more robust solution than client-side reconciliation.⚠️ Spec gap: No Idempotency-Key header is documented for POST /api/Raise/give. Confirm whether the platform supports one before relying on the client-side reconciliation pattern.

Circuit breakers

For workloads that touch many records, a sustained failure can produce a cascade — many in-flight requests all hitting the same issue, all retrying, all eventually failing. A circuit breaker stops the cascade by short-circuiting requests after a threshold of failures.

A basic circuit breaker

JavaScript

class CircuitBreaker {
  constructor({ failureThreshold = 10, resetTimeoutMs = 60000 }) {
    this.failureThreshold = failureThreshold;
    this.resetTimeoutMs = resetTimeoutMs;
    this.state = 'closed';        // 'closed' | 'open' | 'half-open'
    this.failureCount = 0;
    this.openedAt = null;
  }

  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.resetTimeoutMs) {
        this.state = 'half-open';
      } else {
        throw new CircuitBreakerOpenError('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      if (this.state === 'half-open') {
        // Success after half-open — close the circuit
        this.state = 'closed';
        this.failureCount = 0;
      }
      return result;
    } catch (err) {
      this.failureCount++;
      if (this.failureCount >= this.failureThreshold) {
        this.state = 'open';
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}

Use one circuit breaker per logical operation (per-endpoint, per-customer, or per-destination) so a failure in one doesn’t disrupt others:

JavaScript

const breakers = new Map();

function getBreaker(key) {
  if (!breakers.has(key)) {
    breakers.set(key, new CircuitBreaker({ failureThreshold: 10, resetTimeoutMs: 60000 }));
  }
  return breakers.get(key);
}

async function callRaise(url, options, breakerKey) {
  return getBreaker(breakerKey).call(() => callWithRetry(url, options));
}

When the breaker opens, in-flight requests fail fast rather than producing further retries. After the reset timeout, the breaker tentatively allows a few requests through (“half-open”). If they succeed, the breaker closes; if not, it stays open.

When to use circuit breakers

Use a breaker	Don’t bother
Operations that produce many concurrent requests (bulk syncs, parallel reads)	Single-request workflows (one-off API calls)
Operations that hit shared downstream resources	Operations against many independent endpoints
Operations that are expensive to retry (payment processing)	Operations that are cheap and idempotent

For partner integrations operating at scale (hundreds of customers, thousands of requests per minute), circuit breakers prevent localized issues from cascading into widespread degradation.

Dead-letter queues

When all retries fail, the operation can’t continue. Two options: drop it silently (bad — lost work) or move it to a dead-letter queue for human review (good).

A dead-letter pattern

CREATE TABLE dead_letter_queue (
  id BIGSERIAL PRIMARY KEY,
  customer_id TEXT NOT NULL,
  operation_type TEXT NOT NULL,
  operation_payload JSONB NOT NULL,
  last_error TEXT NOT NULL,
  last_error_status INTEGER,
  attempts INTEGER NOT NULL,
  first_attempted_at TIMESTAMPTZ NOT NULL,
  last_attempted_at TIMESTAMPTZ NOT NULL,
  resolved_at TIMESTAMPTZ,
  resolution TEXT
);

The flow:

Operation fails permanently or exhausts retries

A 400 validation error, a sustained 5xx, or a network error that doesn’t recover.

Move the operation to the dead-letter queue

Capture the full operation payload, the last error, and the attempt history.

Continue processing other operations

One bad operation doesn’t block the queue.

Surface the dead-letter entry for review

Alert or daily digest to ops; expose in a UI for support staff.

Investigate and resolve

Either fix the underlying issue and replay, or mark the operation as permanently lost.

Replaying from the dead-letter queue

For operations that failed due to a transient issue that’s now resolved, replay them:

JavaScript

async function replayDeadLetter(dlqId) {
  const entry = await dlq.findById(dlqId);

  try {
    await performOperation(entry.operation_payload);
    await dlq.markResolved(dlqId, 'replay_succeeded');
  } catch (err) {
    // Failed again — update the attempt count, leave in DLQ
    await dlq.recordAdditionalFailure(dlqId, err);
    throw err;
  }
}

A reasonable UI: an ops dashboard showing dead-letter entries with “replay” and “mark resolved” buttons. Most entries are resolved by replay once the underlying issue is fixed (credential renewed, downstream system back up, etc.).

Surfacing errors to humans

Not every error needs to wake someone up. A reasonable severity model:

Severity	What triggers it	Response
Page	Sustained failures affecting many customers; widespread integration outage	On-call engineer woken up
High alert	Single-customer issue blocking critical workflow (sync paused, donations failing)	Same-day investigation
Warning	Elevated error rate, dead-letter accumulation, expired credentials	Next-business-day review
Info / log only	Per-request retries, expected `404`s, dedup decisions	No alert; available in logs

The right thresholds depend on the integration’s SLA. For a major-donor-focused integration, even a single failed POST /api/Raise/give may warrant a same-day investigation. For a low-priority analytics sync, the same failure might be a warning aggregated into a daily digest.

Useful alert content

A useful error alert includes:

The customer affected
The operation that failed
The error classification (transient, permanent, uncertain)
The number of attempts made
The last error message
A link to the dead-letter entry (or wherever the operation can be inspected and replayed)
Suggested next steps based on the error type

JavaScript

async function alertOnDeadLetter(entry) {
  await alerter.send({
    severity: classifyDeadLetterSeverity(entry),
    title: `Operation ${entry.operation_type} failed permanently`,
    fields: {
      customer: entry.customer_id,
      attempts: entry.attempts,
      lastError: entry.last_error,
      lastStatus: entry.last_error_status,
    },
    links: [
      { label: 'View in dashboard', url: `https://ops.example.com/dlq/${entry.id}` },
      { label: 'Replay', url: `https://ops.example.com/dlq/${entry.id}/replay` },
    ],
    suggestedActions: suggestActions(entry),
  });
}

function suggestActions(entry) {
  if (entry.last_error_status === 401) {
    return ['Verify the customer\'s API token', 'Contact customer to issue new token'];
  }
  if (entry.last_error_status === 400) {
    return ['Inspect the payload', 'Update integration logic if validation rule changed'];
  }
  if (!entry.last_error_status) {
    return ['Network issue — try replaying', 'Check Raise API status if recurring'];
  }
  return ['Investigate via dashboard'];
}

Idempotency: the underlying defense

Most error-recovery patterns rely on idempotency — the property that repeating an operation produces the same outcome as running it once. Build idempotency into operations from the start.

Webhook handlers

See Idempotency and Safe Reprocessing for the full pattern. Summary: every event has a unique key (typically contextId + eventType + modifiedDate), and the dedup store records processed events. Retried deliveries skip re-processing.

Downstream writes

For partner integrations writing to external systems, use the external system’s idempotency mechanisms:

Database upserts keyed by Raise resource IDs.
Idempotency keys on third-party API calls (Stripe, Slack, many modern APIs support them).
Conditional logic that checks for existing records before creating new ones.

The combination of webhook-level dedup and downstream-write idempotency ensures retries are safe to perform.

`POST /api/Raise/give` — the special case

Donation submissions are the most challenging case because the operation is genuinely not idempotent at the API level (no documented idempotency-key header). The client-side reconciliation pattern (see The defensive pattern above) is the workaround. When an idempotency-key header becomes available, switch to it.

A complete error-handling pipeline

Putting the patterns together:

JavaScript

class RaiseClient {
  constructor({ token, customerId, breaker, dlq, attemptStore }) {
    this.token = token;
    this.customerId = customerId;
    this.breaker = breaker;
    this.dlq = dlq;
    this.attemptStore = attemptStore;
  }

  async call(url, options = {}, opName = 'unknown') {
    return this.breaker.call(async () => {
      try {
        return await this._callWithRetry(url, options);
      } catch (err) {
        if (err.classification === 'permanent_client') {
          await this.dlq.add({
            customerId: this.customerId,
            operationType: opName,
            operationPayload: { url, options },
            lastError: err.message,
            lastErrorStatus: err.status,
            attempts: err.attempts,
            firstAttemptedAt: err.firstAttemptedAt,
            lastAttemptedAt: new Date(),
          });
          await alertOnError(err, this.customerId, opName);
        }
        throw err;
      }
    });
  }

  async _callWithRetry(url, options, maxAttempts = 5) {
    const firstAttemptedAt = new Date();
    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
      let response;
      try {
        response = await fetch(url, {
          ...options,
          headers: {
            Authorization: `Bearer ${this.token}`,
            'Content-Type': 'application/json',
            ...(options.headers ?? {}),
          },
        });
      } catch (err) {
        // Network error
        if (attempt === maxAttempts) {
          const finalErr = new Error(err.message);
          finalErr.classification = 'transient';
          finalErr.attempts = attempt;
          finalErr.firstAttemptedAt = firstAttemptedAt;
          throw finalErr;
        }
        await sleep(this._backoff(attempt));
        continue;
      }

      if (response.ok) return response.json();

      const problem = await parseProblem(response);
      const classification = classifyError(response.status, problem);

      if (classification !== 'transient') {
        const err = new Error(problem?.detail ?? problem?.title ?? 'Request failed');
        err.classification = classification;
        err.status = response.status;
        err.problem = problem;
        err.attempts = attempt;
        err.firstAttemptedAt = firstAttemptedAt;
        throw err;
      }

      if (attempt === maxAttempts) {
        const err = new Error(`Transient failure after ${maxAttempts} attempts`);
        err.classification = 'transient';
        err.status = response.status;
        err.attempts = attempt;
        err.firstAttemptedAt = firstAttemptedAt;
        throw err;
      }

      // Honor Retry-After if present
      const retryAfter = response.headers.get('Retry-After');
      const delay = retryAfter ? parseInt(retryAfter, 10) * 1000 : this._backoff(attempt);
      await sleep(delay);
    }
  }

  _backoff(attempt) {
    return Math.pow(2, attempt - 1) * 1000 + Math.random() * 1000;
  }
}

Use this pattern for every Raise API call. The cost of writing the pattern once is small; the cost of not having it is paid at every incident.

A recovery checklist

When designing a Raise integration, walk through these questions:

Every API call goes through a function that classifies and retries appropriately
POST /api/Raise/give uses the client-side reconciliation pattern, not naive retry
Webhook handlers are idempotent — retries don’t produce duplicate side effects
Bulk operations use circuit breakers to prevent cascade failures
Permanently-failed operations go to a dead-letter queue
Dead-letter entries are surfaced to ops with enough context to investigate
401 responses pause the affected customer’s work and alert ops, rather than retrying
Network errors and 5xx responses retry with exponential backoff + jitter
Retry-After headers are honored when present
Rate-limit 429 responses are visible in metrics

Most of these are small individually. Together, they make the difference between an integration that requires constant manual intervention and one that recovers from most failures on its own.

Where to go next

Sync Architecture Patterns

The architectural patterns these error-recovery patterns plug into.

API Performance Tips

Performance patterns complementary to recovery — fewer requests means fewer chances to fail.

Rate Limits

The 429 patterns referenced throughout this page.

Idempotency and Safe Reprocessing

Webhook-specific idempotency that pairs with API error recovery.

​Classifying errors

​A reference classifier

​Retry patterns for transient errors

​Exponential backoff

​Honor Retry-After when present

​Don’t retry forever

​Errors that should never be retried

​401 Unauthorized: the credential is bad

​400 validation failures: the request is bad

​404 Not Found: the resource doesn’t exist

​Special case: POST /api/Raise/give retries

​The double-charge risk

​The defensive pattern

​Reconciling uncertain donations

​Circuit breakers

​A basic circuit breaker

​When to use circuit breakers

​Dead-letter queues

​A dead-letter pattern

​Replaying from the dead-letter queue

​Surfacing errors to humans

​Useful alert content

​Idempotency: the underlying defense

​Webhook handlers

​Downstream writes

​POST /api/Raise/give — the special case

​A complete error-handling pipeline

​A recovery checklist

​Where to go next

Sync Architecture Patterns

API Performance Tips

Rate Limits

Idempotency and Safe Reprocessing

Classifying errors

A reference classifier

Retry patterns for transient errors

Exponential backoff

Honor `Retry-After` when present

Don’t retry forever

Errors that should never be retried

`401 Unauthorized`: the credential is bad

`400` validation failures: the request is bad

`404 Not Found`: the resource doesn’t exist

Special case: `POST /api/Raise/give` retries

The double-charge risk

The defensive pattern

Reconciling uncertain donations

Circuit breakers

A basic circuit breaker

When to use circuit breakers

Dead-letter queues

A dead-letter pattern

Replaying from the dead-letter queue

Surfacing errors to humans

Useful alert content

Idempotency: the underlying defense

Webhook handlers

Downstream writes

`POST /api/Raise/give` — the special case

A complete error-handling pipeline

A recovery checklist

Where to go next