Principle 1: classify before you react
Not all errors are equal. Before deciding what to do about one, classify it:| Class | Examples | Treatment |
|---|---|---|
| Transient | 503 Service Unavailable, timeouts, network errors | Retry with backoff |
| Rate-limited | 429 Too Many Requests | Back off; respect Retry-After header |
| Authentication | 401 Unauthorized | Stop; surface to operator; don’t retry |
| Authorization | 403 Forbidden | Stop; the token lacks the needed scope |
| Not found | 404 Not Found | Sometimes retry-after-delay (eventual consistency); often legitimate |
| Validation | 422 Unprocessable Entity | Don’t retry; data is bad |
| Permanent server error | 500 Internal Server Error (sustained) | Retry a few times; then surface |
| Conflict | 409 Conflict | Resource was modified concurrently; fetch + retry |
422 indefinitely just wastes API budget; not retrying a 503 produces unnecessary sync gaps.
A classification helper
JavaScript
Principle 2: retry with exponential backoff + jitter
For retryable errors, don’t retry immediately. Wait, retry, wait longer, retry again. The pattern:JavaScript
Why exponential
Linear backoff (1s, 2s, 3s) is too aggressive on persistent failures. Exponential (1s, 2s, 4s, 8s, 16s) gives the system time to recover between attempts while not waiting forever on transient blips.Why jitter
Without jitter, multiple workers experiencing the same failure all retry at exactly the same intervals — producing a thundering herd that overwhelms the recovering API. Random jitter spreads out the retry attempts.Why Retry-After
For 429, the API explicitly tells you when to retry. Respect it. Don’t compute your own backoff; the API knows when it’ll be ready.
Bounded retries
After max retries, stop and surface. Indefinite retries hide systemic problems.Principle 3: per-record isolation in batches
When processing a batch (a poll cycle, a backfill page), don’t let one bad record fail the whole batch:JavaScript
Checkpoint advancement with partial failures
This is subtle: if some records in a batch succeed and others fail, advancing the checkpoint past the failed ones means they’re “skipped” forever (next poll won’t re-see them). Two approaches: Approach A: advance checkpoint only to the most-recent succeeded record:JavaScript
updated_at ascending. But records with the same updated_at (rare) might be re-processed.
Approach B: advance fully, but capture failures in DLQ:
JavaScript
Principle 4: dead-letter queue management
Failures go to a DLQ. But the DLQ isn’t self-cleaning — it needs its own processing.DLQ schema
DLQ processor
JavaScript
Permanent failures
After 10 retries spread over hours/days, give up. Mark for manual review:JavaScript
Principle 5: circuit breakers for cascade prevention
When the API or a downstream system is broken, retrying makes things worse — wasting requests on a system that can’t respond. A circuit breaker detects sustained failure and pauses operations:JavaScript
How it works
| State | Behavior |
|---|---|
| Closed (normal) | All requests pass through; failures increment count |
| Open (broken) | All requests immediately fail without hitting the API; wait timeoutMs |
| Half-open (recovering) | After timeout, allow one request through; if it succeeds, close; if it fails, re-open |
Per-customer vs. global breakers
For multi-tenant integrations:- Global breaker on the VOMO API: opens when VOMO itself is down — affects all customers
- Per-customer breaker on the external destination: opens when a specific customer’s destination is failing — isolates the impact
JavaScript
Principle 6: idempotency in the absence of API support
VOMO’s API doesn’t support idempotency keys (noIdempotency-Key header). This means:
- A retried
POST /userswith the same payload upserts (safe; the email match handles idempotency) - A retried
POST /groupscreates a new Group (not safe — produces duplicates) - A retried
DELETE /groups/{id}is safe (already deleted = no-op)
JavaScript
When this isn’t enough
For operations that can fail between “API succeeded” and “we recorded the result,” you have a brief window where:- The API created the Group
- The network response was lost
- The retry creates a new Group
- After failures, search for the just-created resource before retrying. For Groups, list recently-created Groups; if one matches your intent, use it instead of creating a new one
- For non-Group resources (Users via upsert, Project via PUT), the upsert/PUT semantics naturally idempotent — the worst case is “second attempt finds it already in the desired state”
Principle 7: graceful degradation
When some part of the integration is broken, what continues working? Design for partial functionality:Tiered functionality
JavaScript
Promise.allSettled (vs. Promise.all) means one failed fetch doesn’t fail the whole dashboard. The UI shows “data temporarily unavailable” for the broken section, but the rest is visible.
Cache fallback
JavaScript
Principle 8: comprehensive error logging
When something breaks, the team needs to debug it. Structured error logging is the foundation:JavaScript
- The trace ID (correlates across requests)
- The specific record being processed
- The full error including HTTP context (status, body)
- The error class for grouping
Principle 9: alert on patterns, not individual failures
A single failure isn’t alarming. A pattern of failures is. Alert thresholds should reflect this:| Pattern | Alert |
|---|---|
| One 5xx in an hour | Don’t alert (probably transient) |
| 5+ 5xx in 5 minutes | Alert (sustained issue) |
| Same customer has many failures, others normal | Alert (customer-specific issue) |
| Same error class concentrates after a deployment | Alert (likely regression) |
| DLQ growth rate accelerates | Alert (something systemic) |
| Polling lag exceeds 2x normal cadence | Alert (worker stalled?) |
Principle 10: error budget thinking
For long-running integrations, embrace the idea of an error budget: a defined “acceptable” error rate, below which alerts don’t fire.JavaScript
| Budget burn | Action |
|---|---|
| 0-100% | No action needed |
| 100-200% | Heightened monitoring; investigation |
| >200% | Alert; consider pausing the integration |
A reference resilient integration
The patterns combined:JavaScript
Where to go next
Sync Architecture Patterns
The broader architectural patterns these recovery practices fit into.
API Performance Tips
The performance patterns these resilience patterns coexist with.
Data Modeling
The data model that supports the DLQ and audit log patterns.
Change Detection Best Practices
The change-detection reliability practices this builds on.