The recovery mental model
Three categories of failure matter:| Category | Symptom | Recovery |
|---|---|---|
| Transient | A momentary network glitch, a temporary platform hiccup, a brief rate-limit spike | Retry — usually succeeds within a few attempts |
| Persistent-recoverable | An expired credential, a temporarily-unreachable endpoint, sustained rate limiting | Pause and alert — succeeds after human intervention |
| Permanent | A malformed request, a missing required field, a deleted resource | Stop and surface — never resolves on retry alone |
Classifying errors
The HTTP status code is the primary classification signal:| Status | Category | Why |
|---|---|---|
2xx | Success | No error. |
400 Bad Request | Permanent | The request is malformed; retrying without changes won’t help. |
401 Unauthorized | Persistent-recoverable | Credentials are invalid or expired. Human intervention needed to refresh. |
403 Forbidden | Permanent | Permissions issue. Retrying won’t help; the API key needs different permissions. |
404 Not Found | Context-dependent | If the resource never existed: permanent. If it was just created and indexing is delayed: transient. |
409 Conflict | Context-dependent | Concurrent update or unique-constraint violation. Sometimes retryable, sometimes permanent. |
422 Unprocessable Entity | Permanent | Validation failed. Same request will fail again. |
429 Too Many Requests | Transient | Rate limit exceeded. Backs off and succeeds. |
500 Internal Server Error | Transient | Server-side issue. Usually resolves on retry. |
502 Bad Gateway | Transient | Network or proxy issue. Resolves on retry. |
503 Service Unavailable | Transient | Server overloaded or maintenance. Resolves on retry. |
504 Gateway Timeout | Transient | Upstream timeout. Resolves on retry. |
| Network error (DNS, connection refused, timeout) | Transient | Resolves on retry. |
The 404 and 409 edge cases
The two ambiguous codes deserve explicit handling:404 after creation. If you just created a resource and look it up moments later, you may briefly get 404 while indexing catches up. Treat as transient with a small retry budget (3–5 attempts over 30 seconds). After that, treat as permanent — the resource probably wasn’t created successfully.
409 Conflict on uniqueness. A unique-constraint violation (e.g., trying to create a Contact with an email that’s already on another Contact) is permanent — retrying produces the same error. A 409 from concurrent modification (e.g., two writers updating the same record at the same instant) is transient and a retry on a fresh GET-then-PUT cycle usually succeeds.
JavaScript
Retry with exponential backoff and jitter
For transient errors, retry with a delay that grows on each attempt. Three properties matter:| Property | Purpose |
|---|---|
| Exponential growth | Quick first retry (resolves momentary glitches), longer subsequent retries (allow real outages to recover) |
| Jitter | Avoid synchronized retries from many clients hitting the same endpoint simultaneously |
| Max attempts | Bound the work — eventually classify as persistent-recoverable and escalate |
JavaScript
| Attempt | Delay before retry |
|---|---|
| 1 → 2 | 1–1.5s |
| 2 → 3 | 2–3s |
| 3 → 4 | 4–6s |
| 4 → 5 | 8–12s |
maxAttempts and maxDelayMs for a longer total budget.
Honor Retry-After for 429
When the server tells you when to retry, listen:
JavaScript
Retry-After header is more accurate than exponential backoff for the rate-limit case — the server knows exactly when the window resets.
Differentiate idempotent from non-idempotent operations
Retrying aGET is safe — the same response comes back. Retrying a POST that may have already partially succeeded is risky — you could create duplicate resources.
The dividing line:
| Operation | Idempotent if… |
|---|---|
GET | Always |
POST /api/v2/Gift/Transaction | The submission carries a stable transactionSource + transactionId |
POST /api/Contact/Transaction | The submission carries a stable referenceSource + referenceId |
POST /api/Contact (direct) | Never — each call creates a new Contact |
PUT /api/Contact/{id} | Always — same body produces same final state |
DELETE /api/Gift/{id} | After first success, subsequent calls 404 — typically safe |
JavaScript
Dead letter queue for permanent failures
Records that exhaust retries and are classified as permanent failures should not silently disappear. Route them to a dead letter queue — a separate store for inspection and possible manual replay.- Visibility. Failures don’t disappear into a log file; they have a structured record.
- Replay. After fixing the underlying problem, the original payload can be re-submitted.
- Audit. A persistent record of what failed and why supports later investigation.
Producing entries
JavaScript
Reviewing and replaying
Dead-letter entries need human review. Surface them in your integration’s admin UI with replay actions:- Replay as-is — re-submit the original payload. Useful after a transient infrastructure fix.
- Fix and replay — edit the payload (correct a field, change a Project code) and resubmit.
- Discard — mark the record as a known-bad case that shouldn’t be retried.
Circuit breakers
When a downstream system is failing repeatedly, continuing to send requests just produces more failures and consumes resources. A circuit breaker stops sending requests after sustained failure and lets the downstream recover.The three states
| State | Behavior |
|---|---|
| Closed | Normal operation. Requests pass through. Track failures. |
| Open | Too many recent failures. Requests fail immediately without calling the downstream. |
| Half-open | Cool-down has elapsed. Allow one test request through to see if downstream is healthy. |
Implementation
JavaScript
Tuning
The right threshold depends on the integration’s traffic:- High-traffic (hundreds of requests/minute): higher threshold (e.g., 20 failures), shorter open duration (30 seconds).
- Low-traffic (a few requests/minute): lower threshold (e.g., 5 failures), longer open duration (5 minutes).
Idempotency as the recovery enabler
Idempotency isn’t optional in a recovery-aware integration — it’s the precondition that makes retries safe. For partner integrations writing to Virtuous, the idempotency mechanism is:- Contacts:
referenceSource+referenceId. The matching algorithm uses these to find existing records on a retry. - Gifts:
transactionSource+transactionId. Same pattern. - RecurringGifts:
transactionSource+transactionId. Same pattern.
transactionId for Gifts for the prevention pattern.
Reconciliation as the safety net
Retries handle transient failures. Dead-letter queues handle permanent failures. But there’s a third category: failures that produced no error but still left the system in an inconsistent state — a webhook delivery that exhausted its retry budget, a request that timed out but actually succeeded, a record updated on one side but not the other. Reconciliation is the safety net for these. Periodically compare the partner-side and Virtuous-side states for resources you sync. Surface discrepancies for action. See Reconcile Failed Syncs for the full pattern. The key insight for this page: assume your retry logic is imperfect, and design a reconciliation pass that doesn’t depend on it being correct.Observability for recovery debugging
Errors that aren’t observed are errors that can’t be fixed. Three observability practices make recovery investigations tractable:Structured logging
Every error log should include enough context to investigate later:JavaScript
- The customer identifier (for multi-tenant integrations).
- The record identifier (for traceability).
- The Virtuous request ID if returned in response headers (lets Virtuous engineering correlate against their logs).
- The attempt count (for retry analysis).
Metrics on failure categories
Track separate counters for each failure category:| Counter | What it tells you |
|---|---|
virtuous_request_total | Overall request volume |
virtuous_request_failures{category="transient"} | Retry-driven noise; expected to be > 0 |
virtuous_request_failures{category="persistent_recoverable"} | Credentials and infrastructure issues; should be near zero |
virtuous_request_failures{category="permanent"} | Data quality and bug issues; investigation target |
virtuous_dead_letter_queue_depth | Backlog of unresolved permanent failures |
virtuous_circuit_breaker_state{breaker="virtuous_main"} | Per-breaker state for alerting |
Tracing where supported
For complex sync workflows that span multiple services, distributed tracing (OpenTelemetry, etc.) makes failure investigation dramatically easier. A trace shows the path of a single record through your queue, submitter, Virtuous API, and back through the webhook receiver — surfacing where in the path the failure occurred.Operational practices
Three practices keep a recovery system healthy over time:Periodic recovery testing
In a staging environment, deliberately inject failures and confirm the recovery paths work:- Block your integration’s network for a minute and confirm requests are retried successfully.
- Return
429from a mocked endpoint and confirm the rate-limit pause works. - Send malformed payloads and confirm dead-letter routing works.
Dead-letter queue review cadence
Schedule a regular (weekly or daily) review of dead-letter entries. Decide for each: replay, fix, or discard. An ignored dead-letter queue silently accumulates unresolved problems that compound over time.Failure-rate baseline
Establish what “normal” looks like for each error category. A transient failure rate of 0.5% is probably fine; a sudden jump to 5% needs investigation. Without a baseline, you can’t tell normal from anomalous.Where to go next
API Performance Tips
The performance practices that complement these recovery patterns.
Reconcile Failed Syncs
The reconciliation safety net that catches what retries miss.
Idempotency and Safe Reprocessing
The precondition that makes retries safe.
Error Handling
The error-envelope reference that the classification on this page builds on.