The cross-cutting practices for production-grade polling and reconciliation — durable checkpointing, idempotency at every layer, drift detection metrics, the interplay between polling and reconciliation, and the patterns for debugging stalled syncs.
This page covers the cross-cutting practices that turn a working polling integration into a production-grade one. The previous pages in this group cover the mechanics (polling, user-change detection, project-change detection, reconciliation). This one covers the practices that span them — checkpointing, idempotency, drift detection, cost/reliability trade-offs, and debugging.The audience is integration architects designing a polling architecture that will run in production for years, not engineers prototyping for a demo.
The checkpoint — “we’ve processed everything up to time X” — is the most important piece of state in the integration. If it’s lost or corrupted, the integration’s understanding of “what’s been done” is wrong.
The critical pattern: never advance the checkpoint before processing is complete.
JavaScript
// ❌ Anti-pattern: advance earlyawait setCheckpoint(customerId, 'user_sync', latestSeen); // What if processing crashes?for (const user of users) { await processUser(user); // Crash here → checkpoint is past these users}// ✅ Correct: advance after all processing succeedslet latestSeen = currentCheckpoint;for (const user of users) { await processUser(user); const u = new Date(user.updated_at); if (u > latestSeen) latestSeen = u;}await setCheckpoint(customerId, 'user_sync', latestSeen); // After processing
For workflows where processing can fail per-record (DLQ pattern), the principle still holds: advance to the latest successfully processed record’s updated_at, not to wall-clock time.
Polling, reconciliation, and processing all retry. They retry on failures, on restarts, on operator-initiated re-runs. Operations must be safe to repeat.
Three patterns:Pattern A: idempotent destination operationsUse upsert operations on the destination. The classic example:
JavaScript
// Destination has an upsert operation keyed by external IDawait externalSystem.upsertUser({ external_id: `vomo-${user.id}`, email: user.email, first_name: user.first_name, last_name: user.last_name,});
Whether this is the first sync or the hundredth, the destination ends in the same state.Pattern B: deduplication keysFor operations that aren’t naturally idempotent (sending emails, creating records), use a deduplication key:
JavaScript
async function sendWelcomeEmailOnce(userId) { const key = `welcome_email:${userId}`; const alreadySent = await externalDb.dedupExists(key); if (alreadySent) return; await emailService.sendWelcome(userId); await externalDb.recordDedup(key);}
The dedup record prevents repeated sends even if the polling cycle re-discovers the user.Pattern C: optimistic concurrencyFor operations that update existing records, use if-not-changed-since semantics:
JavaScript
async function updateUserIfNewer(externalUserId, sourceUpdatedAt, changes) { const current = await externalSystem.getUser(externalUserId); if (new Date(current.lastSyncedAt) >= new Date(sourceUpdatedAt)) { // Already up-to-date or newer — skip return { skipped: true }; } await externalSystem.updateUser(externalUserId, { ...changes, lastSyncedAt: sourceUpdatedAt }); return { updated: true };}
A stale repeat write (e.g., from reconciliation re-discovering an already-processed record) is skipped without changing destination state.
For operations with one-time side effects (welcome emails, provisioning new accounts in third-party systems), dedup keys are essential. Without them, retries produce duplicated work.For workflows where you genuinely can’t guarantee idempotency, ensure the operation only happens at well-defined moments — typically only when the integration knows the user is “new” (via persistent state lookup, not heuristics).
Knowing whether the integration is working is as important as making it work. Drift detection is the practice of continuously verifying that observable reality matches expectations.
The right alert thresholds depend on customer expectations and your operational maturity. Start conservative — too few alerts is more dangerous than too many.
Aggregate view — total customers, % healthy, anomalies
Resource-by-resource trends
How active is each resource type across customers?
Error landscape
Top error types over time
For partner integrations serving many customers, the per-customer dashboard is the most useful — it answers “is this specific customer’s integration healthy?” in seconds.
Every polling and reconciliation operation has a cost in API request budget. Reliability has a cost too — but they’re not equally valuable beyond a certain point.
The first 90% of reliability is cheap (basic polling + daily reconciliation). The next 9% is moderately expensive (weekly full reconciliation, sample auditing). The last 1% (real-time drift detection, per-record verification, multi-region failover) is very expensive.For most partner integrations, target ~99% reliability. The remaining 1% is handled by:
Customer-visible audit trails (so issues are visible when they occur)
Operator escalation paths (so the team can intervene when needed)
Per-customer support tooling (so issues can be debugged efficiently)
Investing in perfect reliability beyond this is usually worse ROI than investing in better debuggability.
Often 5-minute polling is fine; “real-time” is rarely a hard requirement
Customer needs daily reporting
Hourly polling + nightly reconciliation
Customer needs analytics dashboards
Daily sync is often sufficient
Customer’s compliance team requires audit trail
The reconciliation infrastructure provides this
Push back on “real-time” requirements — most aren’t truly real-time needs, just comfort goals. A clear conversation about what business problem the freshness solves often reveals that hourly is fine.
The trace ID lets you reconstruct everything that happened in a specific poll cycle later. When a customer says “data for user X is missing,” searching logs by user X plus a date range pulls up the exact cycle that should have processed it.
Show me records processed for user Y in customer X
Is this user’s data being synced?
Show me the dead-letter queue for customer X
What’s failing?
Show me the reconciliation gaps from yesterday
What did reconciliation catch?
Show me API errors in the last hour
What’s broken right now?
The tooling doesn’t need to be fancy — even simple CLI scripts that hit the structured-log store and per-record database are sufficient for most debugging needs.
For investigating issues, the ability to re-run a polling cycle against historical state is valuable:
JavaScript
async function replayPollCycle(customerId, fromCheckpoint, toCheckpoint) { // Run the polling logic against the time window, but emit to a dry-run destination const dryRun = new DryRunDestination(); const result = await runPollLogic({ customerId, fromCheckpoint, toCheckpoint, destination: dryRun, }); return { result, operationsThatWouldHaveRun: dryRun.recordedOperations, };}
Replay against past time windows in a dry-run mode helps investigate “why didn’t this record get processed?” questions.
Records that failed processing; retried separately
Reconciliation worker
Finds gaps; re-processes from DLQ; detects deletions
Per-record audit log
Records every processing attempt with outcome
Drift detection metrics
Aggregates audit log into health metrics
Alerts + dashboards
Surfaces problems to operators
The complexity is real, but each component does one well-defined job. The combination produces production-grade reliability that scales across many customers.
Anti-pattern: “the polling worker IS the integration”
Some integrations are built around a single all-in-one polling worker that does everything. When it breaks, everything breaks.Better: separate polling from processing. Polling enqueues changes; a separate worker processes them. Each can fail and recover independently.
Deletion detection is hard, so it’s often deferred. Then a customer asks why deleted volunteers still appear in their reports — and the answer is “we don’t sync deletions.”Better: build deletion detection from the start, even if it’s just weekly. The infrastructure is the same as full reconciliation; you’re doing it anyway.
Anti-pattern: “we’ll just re-sync everything when there’s a problem”
For small customers this works. For large customers, “re-sync everything” is hours of API calls and processing. Plan for incremental recovery, not just full reset.
Sometimes reconciliation finds gaps because polling has a bug. Auto-fixing hides the bug behind reconciliation’s automatic correction. Treat sustained reconciliation gaps as a signal to investigate, not just a checklist item to clear.
Different customers have different scales and needs. Hardcoded “every 15 minutes for everything” works initially but scales poorly. Make cadence per-customer-configurable from the start.
Daily reconciliation catches gaps; deletion detection works
4: Audited
Per-record audit trail; sample-based drift detection; alerts on metrics
5: Self-healing
Reconciliation auto-corrects; failures trigger graduated responses; customer-facing health visibility
Most production integrations land between Level 3 and Level 4. Level 5 is reserved for the highest-stakes integrations (compliance-critical workflows, financial reporting, etc.).For each level, the previous level’s practices are foundational — you can’t skip from Level 1 to Level 4.