Skip to main content
This page covers the cross-cutting practices that turn a working polling integration into a production-grade one. The previous pages in this group cover the mechanics (polling, user-change detection, project-change detection, reconciliation). This one covers the practices that span them — checkpointing, idempotency, drift detection, cost/reliability trade-offs, and debugging. The audience is integration architects designing a polling architecture that will run in production for years, not engineers prototyping for a demo.

The five core practices

PracticeWhy it matters
Durable checkpointingThe checkpoint is the integration’s state — losing it loses sync correctness
Idempotency at every layerPolling, reconciliation, and processing all retry — operations must tolerate it
Drift detectionKnowing the integration is working is as important as making it work
Cost/reliability balanceEvery poll has a cost; reliability has limits — optimize the trade-off
DebuggabilityWhen sync breaks, the team needs to be able to find out why quickly
Each of these is a multi-page topic in itself. This page covers the essentials.

Practice 1: Durable checkpointing

The checkpoint — “we’ve processed everything up to time X” — is the most important piece of state in the integration. If it’s lost or corrupted, the integration’s understanding of “what’s been done” is wrong.

Where to store checkpoints

StorageSuited for
Relational database (Postgres, MySQL)Strongly consistent, transactional updates with other state changes
Key-value store (Redis, DynamoDB)Fast reads, simple write semantics; pair with backup
Cloud secrets managerOverkill — secrets managers are for credentials, not high-frequency reads
FilesystemDon’t — survives crashes poorly, doesn’t scale across instances
In-memory onlyDon’t — lost on every restart
For most partner integrations, a relational database table works:
CREATE TABLE sync_checkpoints (
  customer_id VARCHAR NOT NULL,
  resource    VARCHAR NOT NULL,  -- 'user_sync', 'project_sync', etc.
  checkpoint  TIMESTAMP NOT NULL,
  updated_at  TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (customer_id, resource)
);
A composite key per customer per resource. Updates happen at the end of each successful poll cycle.

Atomic checkpoint advancement

The critical pattern: never advance the checkpoint before processing is complete.
JavaScript
// ❌ Anti-pattern: advance early
await setCheckpoint(customerId, 'user_sync', latestSeen); // What if processing crashes?
for (const user of users) {
  await processUser(user); // Crash here → checkpoint is past these users
}

// ✅ Correct: advance after all processing succeeds
let latestSeen = currentCheckpoint;
for (const user of users) {
  await processUser(user);
  const u = new Date(user.updated_at);
  if (u > latestSeen) latestSeen = u;
}
await setCheckpoint(customerId, 'user_sync', latestSeen); // After processing
For workflows where processing can fail per-record (DLQ pattern), the principle still holds: advance to the latest successfully processed record’s updated_at, not to wall-clock time.

Checkpoint backup

For high-stakes integrations, back up checkpoints separately from the primary store:
JavaScript
async function setCheckpointWithBackup(customerId, resource, checkpoint) {
  // Primary store
  await db.upsert('sync_checkpoints', {
    customer_id: customerId,
    resource,
    checkpoint,
  });

  // Backup store (different region, different provider)
  await backupStore.set(`checkpoint:${customerId}:${resource}`, checkpoint);
}
If the primary store is lost or corrupted (region outage, accidental DELETE, etc.), the backup allows recovery without a full re-sync.

Recovery from a lost checkpoint

If a checkpoint is missing for a customer:
PathWhen to use
Restore from backupIf a backup exists and is reasonably current
Reset to “all time” (new Date(0))Will re-process every record; expensive but safe
Reset to “1 week ago”If you can tolerate possible misses older than a week
Manual operator decisionFor high-value customers; document the rationale
Don’t silently “guess” — pick a strategy explicitly per customer. Logging the recovery decision gives an audit trail when questions arise later.

Practice 2: Idempotency at every layer

Polling, reconciliation, and processing all retry. They retry on failures, on restarts, on operator-initiated re-runs. Operations must be safe to repeat.

What makes an operation idempotent

An operation is idempotent if performing it N times has the same effect as performing it once. For sync workloads:
OperationIdempotent?
”Set user X’s email to bruce@wayne.example✓ Yes — repeated sets are no-ops
”Add user X to group Y”✓ Yes if the operation checks for existing membership first
”Send a welcome email to user X”✗ No — N runs send N emails
”Append a participation record”✗ No — N runs create N records
”Increment a counter”✗ No — N runs add N

Building idempotency into processing

Three patterns: Pattern A: idempotent destination operations Use upsert operations on the destination. The classic example:
JavaScript
// Destination has an upsert operation keyed by external ID
await externalSystem.upsertUser({
  external_id: `vomo-${user.id}`,
  email: user.email,
  first_name: user.first_name,
  last_name: user.last_name,
});
Whether this is the first sync or the hundredth, the destination ends in the same state. Pattern B: deduplication keys For operations that aren’t naturally idempotent (sending emails, creating records), use a deduplication key:
JavaScript
async function sendWelcomeEmailOnce(userId) {
  const key = `welcome_email:${userId}`;
  const alreadySent = await externalDb.dedupExists(key);
  if (alreadySent) return;

  await emailService.sendWelcome(userId);
  await externalDb.recordDedup(key);
}
The dedup record prevents repeated sends even if the polling cycle re-discovers the user. Pattern C: optimistic concurrency For operations that update existing records, use if-not-changed-since semantics:
JavaScript
async function updateUserIfNewer(externalUserId, sourceUpdatedAt, changes) {
  const current = await externalSystem.getUser(externalUserId);
  if (new Date(current.lastSyncedAt) >= new Date(sourceUpdatedAt)) {
    // Already up-to-date or newer — skip
    return { skipped: true };
  }

  await externalSystem.updateUser(externalUserId, { ...changes, lastSyncedAt: sourceUpdatedAt });
  return { updated: true };
}
A stale repeat write (e.g., from reconciliation re-discovering an already-processed record) is skipped without changing destination state.

When idempotency is impossible

For operations with one-time side effects (welcome emails, provisioning new accounts in third-party systems), dedup keys are essential. Without them, retries produce duplicated work. For workflows where you genuinely can’t guarantee idempotency, ensure the operation only happens at well-defined moments — typically only when the integration knows the user is “new” (via persistent state lookup, not heuristics).

Practice 3: Drift detection

Knowing whether the integration is working is as important as making it work. Drift detection is the practice of continuously verifying that observable reality matches expectations.

What to measure

MetricWhat it tells you
Poll cycle success rateAre polls completing without errors?
Records-per-cycle countsWhat’s the activity rate? Sudden change = signal
Checkpoint lag (now - latest checkpoint)How stale is the integration’s view of VOMO?
Dead-letter queue depthHow many records are unprocessed?
Reconciliation gap rateWhat % of records does reconciliation find as gaps?
Field-level drift (sample)Are field values diverging between VOMO and external?
API error rates by status codeWhat kind of failures are happening?
API latency percentilesIs VOMO slowing down?
Per-customer breakdown of all the aboveWhich customers are healthy vs. struggling?

Setting up alerts

AlertThreshold
Polling has been silentNo checkpoint advance in 2x the poll interval
Dead-letter queue growingDLQ depth >100 OR growing daily
Reconciliation finding many gapsDaily gap rate >1% of expected records
429 rate elevatedMore than one 429 per polling cycle
5xx rate elevatedAny sustained 5xx > 0.5%
Per-customer driftSpecific customer’s metrics deviating from peers
The right alert thresholds depend on customer expectations and your operational maturity. Start conservative — too few alerts is more dangerous than too many.

Dashboards over alerts

Alerts catch acute problems. Dashboards catch slow drift:
DashboardPurpose
Per-customer sync healthEach customer’s polling success, checkpoint lag, DLQ depth, recent reconciliation results
Cross-customer summaryAggregate view — total customers, % healthy, anomalies
Resource-by-resource trendsHow active is each resource type across customers?
Error landscapeTop error types over time
For partner integrations serving many customers, the per-customer dashboard is the most useful — it answers “is this specific customer’s integration healthy?” in seconds.

Practice 4: Cost vs. reliability balance

Every polling and reconciliation operation has a cost in API request budget. Reliability has a cost too — but they’re not equally valuable beyond a certain point.

The cost curve

Cost
 ^
 |               *
 |             *
 |          *       <-- Diminishing returns
 |       *
 |    *
 |  *
 |*
 +--------------------> Reliability
The first 90% of reliability is cheap (basic polling + daily reconciliation). The next 9% is moderately expensive (weekly full reconciliation, sample auditing). The last 1% (real-time drift detection, per-record verification, multi-region failover) is very expensive. For most partner integrations, target ~99% reliability. The remaining 1% is handled by:
  • Customer-visible audit trails (so issues are visible when they occur)
  • Operator escalation paths (so the team can intervene when needed)
  • Per-customer support tooling (so issues can be debugged efficiently)
Investing in perfect reliability beyond this is usually worse ROI than investing in better debuggability.

Right-sizing cadences

WorkloadRight-sized cadence
Customer asks for “real-time” syncOften 5-minute polling is fine; “real-time” is rarely a hard requirement
Customer needs daily reportingHourly polling + nightly reconciliation
Customer needs analytics dashboardsDaily sync is often sufficient
Customer’s compliance team requires audit trailThe reconciliation infrastructure provides this
Push back on “real-time” requirements — most aren’t truly real-time needs, just comfort goals. A clear conversation about what business problem the freshness solves often reveals that hourly is fine.

Per-resource cadence tuning

Within an integration:
ResourceCadence sensitivity
UsersMedium — typically affects downstream business workflows
ProjectsLow — schedules change infrequently
GroupsLow — membership changes infrequently
Form CompletionsMedium-high — often drives onboarding workflows
CertificatesLow — change rarely
CampaignsVery low — change rarely
Different cadences across resources cuts total request volume by 50-80% versus a uniform “every 15 min for everything” cadence.

Practice 5: Debuggability

When sync breaks, the team needs to be able to find out why quickly. The practices that enable this:

Structured logging

Every polling and reconciliation operation should log:
FieldWhy
customer_idFor per-customer drill-down
resourceWhich resource was being processed
operationPoll? Reconciliation? Process?
checkpoint_before / checkpoint_afterDid the checkpoint advance?
records_seen / records_processedActivity counts
duration_msPerformance signal
outcomesuccess / partial / failed
error (if applicable)Error details with stack trace
Structured logs (JSON, not strings) make filtering and aggregation possible.

Trace IDs across the pipeline

For each polling cycle, generate a trace ID and propagate it through every operation:
JavaScript
async function pollUserChanges(customerId) {
  const traceId = generateTraceId();
  logger.info('Poll started', { customerId, traceId, resource: 'users' });

  try {
    const result = await doPoll(customerId, traceId);
    logger.info('Poll completed', { customerId, traceId, result });
    return result;
  } catch (err) {
    logger.error('Poll failed', { customerId, traceId, error: err.message });
    throw err;
  }
}
The trace ID lets you reconstruct everything that happened in a specific poll cycle later. When a customer says “data for user X is missing,” searching logs by user X plus a date range pulls up the exact cycle that should have processed it.

Per-record audit trail

For each record processed, record:
JavaScript
async function processUserChange(customerId, user, traceId) {
  await db.recordProcessing({
    customer_id: customerId,
    resource: 'user',
    record_id: user.id,
    record_updated_at: user.updated_at,
    processed_at: new Date(),
    trace_id: traceId,
    outcome: 'success', // or 'failed' on error path
  });
}
The per-record audit is what answers “when did we last process user X?” — essential for both reconciliation and customer support.

Inspection tooling

Build operator tooling that exposes:
QueryWhat it answers
Show me the recent poll cycles for customer XIs polling healthy for this customer?
Show me records processed for user Y in customer XIs this user’s data being synced?
Show me the dead-letter queue for customer XWhat’s failing?
Show me the reconciliation gaps from yesterdayWhat did reconciliation catch?
Show me API errors in the last hourWhat’s broken right now?
The tooling doesn’t need to be fancy — even simple CLI scripts that hit the structured-log store and per-record database are sufficient for most debugging needs.

Reproduction without production

For investigating issues, the ability to re-run a polling cycle against historical state is valuable:
JavaScript
async function replayPollCycle(customerId, fromCheckpoint, toCheckpoint) {
  // Run the polling logic against the time window, but emit to a dry-run destination
  const dryRun = new DryRunDestination();

  const result = await runPollLogic({
    customerId,
    fromCheckpoint,
    toCheckpoint,
    destination: dryRun,
  });

  return {
    result,
    operationsThatWouldHaveRun: dryRun.recordedOperations,
  };
}
Replay against past time windows in a dry-run mode helps investigate “why didn’t this record get processed?” questions.

Putting it together: a polling reliability blueprint

A reference blueprint for production polling reliability: The components:
ComponentRole
Polling workerCalls VOMO; advances checkpoints; processes records
Durable checkpoint storePer-customer-per-resource last-processed timestamps
Per-record processingIdempotent; writes to destination + audit log
Dead-letter queueRecords that failed processing; retried separately
Reconciliation workerFinds gaps; re-processes from DLQ; detects deletions
Per-record audit logRecords every processing attempt with outcome
Drift detection metricsAggregates audit log into health metrics
Alerts + dashboardsSurfaces problems to operators
The complexity is real, but each component does one well-defined job. The combination produces production-grade reliability that scales across many customers.

Common anti-patterns

A few patterns that look reasonable but cause production issues:

Anti-pattern: “the polling worker IS the integration”

Some integrations are built around a single all-in-one polling worker that does everything. When it breaks, everything breaks. Better: separate polling from processing. Polling enqueues changes; a separate worker processes them. Each can fail and recover independently.

Anti-pattern: “we’ll catch deletions someday”

Deletion detection is hard, so it’s often deferred. Then a customer asks why deleted volunteers still appear in their reports — and the answer is “we don’t sync deletions.” Better: build deletion detection from the start, even if it’s just weekly. The infrastructure is the same as full reconciliation; you’re doing it anyway.

Anti-pattern: “we’ll just re-sync everything when there’s a problem”

For small customers this works. For large customers, “re-sync everything” is hours of API calls and processing. Plan for incremental recovery, not just full reset.

Anti-pattern: “the audit log is just for compliance”

Audit logs become invaluable for debugging. Make them queryable, filterable, and indexed — not just write-only.

Anti-pattern: “if reconciliation finds gaps, auto-fix them”

Sometimes reconciliation finds gaps because polling has a bug. Auto-fixing hides the bug behind reconciliation’s automatic correction. Treat sustained reconciliation gaps as a signal to investigate, not just a checklist item to clear.

Anti-pattern: hardcoded cadences across customers

Different customers have different scales and needs. Hardcoded “every 15 minutes for everything” works initially but scales poorly. Make cadence per-customer-configurable from the start.

A maturity model

Where is your integration on the polling reliability maturity model?
LevelCharacteristics
1: WorkingPolling runs; data flows; occasional gaps tolerated
2: MonitoredPer-customer dashboards exist; obvious failures detected
3: ReconciledDaily reconciliation catches gaps; deletion detection works
4: AuditedPer-record audit trail; sample-based drift detection; alerts on metrics
5: Self-healingReconciliation auto-corrects; failures trigger graduated responses; customer-facing health visibility
Most production integrations land between Level 3 and Level 4. Level 5 is reserved for the highest-stakes integrations (compliance-critical workflows, financial reporting, etc.). For each level, the previous level’s practices are foundational — you can’t skip from Level 1 to Level 4.

Production checklist

For a polling integration at Level 3+:
  • Checkpoints stored in durable, transactional storage
  • Checkpoints advanced only after successful per-record processing
  • Per-record processing operations are idempotent
  • Dedup keys protect non-idempotent operations
  • Per-resource cadences tuned to actual freshness needs
  • Daily incremental reconciliation runs for each resource
  • Weekly full reconciliation including deletion detection
  • Per-record audit trail with trace IDs
  • Per-customer dashboards exposing sync health
  • Alerts on stalled checkpoints, growing DLQ, elevated error rates, reconciliation gap rates
  • Cadences are per-customer-configurable
  • Replay/dry-run tooling exists for debugging
  • Documented runbook for common failure modes
  • On-call playbook for the most common alerts

Where to go next

Reconciliation Patterns

The companion page on the reconciliation patterns this page builds on.

Sync Architecture Patterns

The broader architectural patterns these practices fit into.

Error Recovery Patterns

The error-handling patterns that support this reliability model.

API Performance Tips

The performance patterns that keep polling efficient at scale.
Last modified on May 22, 2026