Change Detection Best Practices

This page covers the cross-cutting practices that turn a working polling integration into a production-grade one. The previous pages in this group cover the mechanics (polling, user-change detection, project-change detection, reconciliation). This one covers the practices that span them — checkpointing, idempotency, drift detection, cost/reliability trade-offs, and debugging. The audience is integration architects designing a polling architecture that will run in production for years, not engineers prototyping for a demo.

The five core practices

Practice	Why it matters
Durable checkpointing	The checkpoint is the integration’s state — losing it loses sync correctness
Idempotency at every layer	Polling, reconciliation, and processing all retry — operations must tolerate it
Drift detection	Knowing the integration is working is as important as making it work
Cost/reliability balance	Every poll has a cost; reliability has limits — optimize the trade-off
Debuggability	When sync breaks, the team needs to be able to find out why quickly

Each of these is a multi-page topic in itself. This page covers the essentials.

Practice 1: Durable checkpointing

The checkpoint — “we’ve processed everything up to time X” — is the most important piece of state in the integration. If it’s lost or corrupted, the integration’s understanding of “what’s been done” is wrong.

Where to store checkpoints

Storage	Suited for
Relational database (Postgres, MySQL)	Strongly consistent, transactional updates with other state changes
Key-value store (Redis, DynamoDB)	Fast reads, simple write semantics; pair with backup
Cloud secrets manager	Overkill — secrets managers are for credentials, not high-frequency reads
Filesystem	Don’t — survives crashes poorly, doesn’t scale across instances
In-memory only	Don’t — lost on every restart

For most partner integrations, a relational database table works:

CREATE TABLE sync_checkpoints (
  customer_id VARCHAR NOT NULL,
  resource    VARCHAR NOT NULL,  -- 'user_sync', 'project_sync', etc.
  checkpoint  TIMESTAMP NOT NULL,
  updated_at  TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (customer_id, resource)
);

A composite key per customer per resource. Updates happen at the end of each successful poll cycle.

Atomic checkpoint advancement

The critical pattern: never advance the checkpoint before processing is complete.

JavaScript

// ❌ Anti-pattern: advance early
await setCheckpoint(customerId, 'user_sync', latestSeen); // What if processing crashes?
for (const user of users) {
  await processUser(user); // Crash here → checkpoint is past these users
}

// ✅ Correct: advance after all processing succeeds
let latestSeen = currentCheckpoint;
for (const user of users) {
  await processUser(user);
  const u = new Date(user.updated_at);
  if (u > latestSeen) latestSeen = u;
}
await setCheckpoint(customerId, 'user_sync', latestSeen); // After processing

For workflows where processing can fail per-record (DLQ pattern), the principle still holds: advance to the latest successfully processed record’s updated_at, not to wall-clock time.

Checkpoint backup

For high-stakes integrations, back up checkpoints separately from the primary store:

JavaScript

async function setCheckpointWithBackup(customerId, resource, checkpoint) {
  // Primary store
  await db.upsert('sync_checkpoints', {
    customer_id: customerId,
    resource,
    checkpoint,
  });

  // Backup store (different region, different provider)
  await backupStore.set(`checkpoint:${customerId}:${resource}`, checkpoint);
}

If the primary store is lost or corrupted (region outage, accidental DELETE, etc.), the backup allows recovery without a full re-sync.

Recovery from a lost checkpoint

If a checkpoint is missing for a customer:

Path	When to use
Restore from backup	If a backup exists and is reasonably current
Reset to “all time” (`new Date(0)`)	Will re-process every record; expensive but safe
Reset to “1 week ago”	If you can tolerate possible misses older than a week
Manual operator decision	For high-value customers; document the rationale

Don’t silently “guess” — pick a strategy explicitly per customer. Logging the recovery decision gives an audit trail when questions arise later.

Practice 2: Idempotency at every layer

Polling, reconciliation, and processing all retry. They retry on failures, on restarts, on operator-initiated re-runs. Operations must be safe to repeat.

What makes an operation idempotent

An operation is idempotent if performing it N times has the same effect as performing it once. For sync workloads:

Operation	Idempotent?
”Set user X’s email to bruce@wayne.example”	✓ Yes — repeated sets are no-ops
”Add user X to group Y”	✓ Yes if the operation checks for existing membership first
”Send a welcome email to user X”	✗ No — N runs send N emails
”Append a participation record”	✗ No — N runs create N records
”Increment a counter”	✗ No — N runs add N

Building idempotency into processing

Three patterns: Pattern A: idempotent destination operations Use upsert operations on the destination. The classic example:

JavaScript

// Destination has an upsert operation keyed by external ID
await externalSystem.upsertUser({
  external_id: `vomo-${user.id}`,
  email: user.email,
  first_name: user.first_name,
  last_name: user.last_name,
});

Whether this is the first sync or the hundredth, the destination ends in the same state. Pattern B: deduplication keys For operations that aren’t naturally idempotent (sending emails, creating records), use a deduplication key:

JavaScript

async function sendWelcomeEmailOnce(userId) {
  const key = `welcome_email:${userId}`;
  const alreadySent = await externalDb.dedupExists(key);
  if (alreadySent) return;

  await emailService.sendWelcome(userId);
  await externalDb.recordDedup(key);
}

The dedup record prevents repeated sends even if the polling cycle re-discovers the user. Pattern C: optimistic concurrency For operations that update existing records, use if-not-changed-since semantics:

JavaScript

async function updateUserIfNewer(externalUserId, sourceUpdatedAt, changes) {
  const current = await externalSystem.getUser(externalUserId);
  if (new Date(current.lastSyncedAt) >= new Date(sourceUpdatedAt)) {
    // Already up-to-date or newer — skip
    return { skipped: true };
  }

  await externalSystem.updateUser(externalUserId, { ...changes, lastSyncedAt: sourceUpdatedAt });
  return { updated: true };
}

A stale repeat write (e.g., from reconciliation re-discovering an already-processed record) is skipped without changing destination state.

When idempotency is impossible

For operations with one-time side effects (welcome emails, provisioning new accounts in third-party systems), dedup keys are essential. Without them, retries produce duplicated work. For workflows where you genuinely can’t guarantee idempotency, ensure the operation only happens at well-defined moments — typically only when the integration knows the user is “new” (via persistent state lookup, not heuristics).

Practice 3: Drift detection

Knowing whether the integration is working is as important as making it work. Drift detection is the practice of continuously verifying that observable reality matches expectations.

What to measure

Metric	What it tells you
Poll cycle success rate	Are polls completing without errors?
Records-per-cycle counts	What’s the activity rate? Sudden change = signal
Checkpoint lag (now - latest checkpoint)	How stale is the integration’s view of VOMO?
Dead-letter queue depth	How many records are unprocessed?
Reconciliation gap rate	What % of records does reconciliation find as gaps?
Field-level drift (sample)	Are field values diverging between VOMO and external?
API error rates by status code	What kind of failures are happening?
API latency percentiles	Is VOMO slowing down?
Per-customer breakdown of all the above	Which customers are healthy vs. struggling?

Setting up alerts

Alert	Threshold
Polling has been silent	No checkpoint advance in 2x the poll interval
Dead-letter queue growing	DLQ depth >100 OR growing daily
Reconciliation finding many gaps	Daily gap rate >1% of expected records
429 rate elevated	More than one 429 per polling cycle
5xx rate elevated	Any sustained 5xx > 0.5%
Per-customer drift	Specific customer’s metrics deviating from peers

The right alert thresholds depend on customer expectations and your operational maturity. Start conservative — too few alerts is more dangerous than too many.

Dashboards over alerts

Alerts catch acute problems. Dashboards catch slow drift:

Dashboard	Purpose
Per-customer sync health	Each customer’s polling success, checkpoint lag, DLQ depth, recent reconciliation results
Cross-customer summary	Aggregate view — total customers, % healthy, anomalies
Resource-by-resource trends	How active is each resource type across customers?
Error landscape	Top error types over time

For partner integrations serving many customers, the per-customer dashboard is the most useful — it answers “is this specific customer’s integration healthy?” in seconds.

Practice 4: Cost vs. reliability balance

Every polling and reconciliation operation has a cost in API request budget. Reliability has a cost too — but they’re not equally valuable beyond a certain point.

The cost curve

Cost
 ^
 |               *
 |             *
 |          *       <-- Diminishing returns
 |       *
 |    *
 |  *
 |*
 +--------------------> Reliability

The first 90% of reliability is cheap (basic polling + daily reconciliation). The next 9% is moderately expensive (weekly full reconciliation, sample auditing). The last 1% (real-time drift detection, per-record verification, multi-region failover) is very expensive. For most partner integrations, target ~99% reliability. The remaining 1% is handled by:

Customer-visible audit trails (so issues are visible when they occur)
Operator escalation paths (so the team can intervene when needed)
Per-customer support tooling (so issues can be debugged efficiently)

Investing in perfect reliability beyond this is usually worse ROI than investing in better debuggability.

Right-sizing cadences

Workload	Right-sized cadence
Customer asks for “real-time” sync	Often 5-minute polling is fine; “real-time” is rarely a hard requirement
Customer needs daily reporting	Hourly polling + nightly reconciliation
Customer needs analytics dashboards	Daily sync is often sufficient
Customer’s compliance team requires audit trail	The reconciliation infrastructure provides this

Push back on “real-time” requirements — most aren’t truly real-time needs, just comfort goals. A clear conversation about what business problem the freshness solves often reveals that hourly is fine.

Per-resource cadence tuning

Within an integration:

Resource	Cadence sensitivity
Users	Medium — typically affects downstream business workflows
Projects	Low — schedules change infrequently
Groups	Low — membership changes infrequently
Form Completions	Medium-high — often drives onboarding workflows
Certificates	Low — change rarely
Campaigns	Very low — change rarely

Different cadences across resources cuts total request volume by 50-80% versus a uniform “every 15 min for everything” cadence.

Practice 5: Debuggability

When sync breaks, the team needs to be able to find out why quickly. The practices that enable this:

Structured logging

Every polling and reconciliation operation should log:

Field	Why
`customer_id`	For per-customer drill-down
`resource`	Which resource was being processed
`operation`	Poll? Reconciliation? Process?
`checkpoint_before` / `checkpoint_after`	Did the checkpoint advance?
`records_seen` / `records_processed`	Activity counts
`duration_ms`	Performance signal
`outcome`	success / partial / failed
`error` (if applicable)	Error details with stack trace

Structured logs (JSON, not strings) make filtering and aggregation possible.

Trace IDs across the pipeline

For each polling cycle, generate a trace ID and propagate it through every operation:

JavaScript

async function pollUserChanges(customerId) {
  const traceId = generateTraceId();
  logger.info('Poll started', { customerId, traceId, resource: 'users' });

  try {
    const result = await doPoll(customerId, traceId);
    logger.info('Poll completed', { customerId, traceId, result });
    return result;
  } catch (err) {
    logger.error('Poll failed', { customerId, traceId, error: err.message });
    throw err;
  }
}

The trace ID lets you reconstruct everything that happened in a specific poll cycle later. When a customer says “data for user X is missing,” searching logs by user X plus a date range pulls up the exact cycle that should have processed it.

Per-record audit trail

For each record processed, record:

JavaScript

async function processUserChange(customerId, user, traceId) {
  await db.recordProcessing({
    customer_id: customerId,
    resource: 'user',
    record_id: user.id,
    record_updated_at: user.updated_at,
    processed_at: new Date(),
    trace_id: traceId,
    outcome: 'success', // or 'failed' on error path
  });
}

The per-record audit is what answers “when did we last process user X?” — essential for both reconciliation and customer support.

Inspection tooling

Build operator tooling that exposes:

Query	What it answers
Show me the recent poll cycles for customer X	Is polling healthy for this customer?
Show me records processed for user Y in customer X	Is this user’s data being synced?
Show me the dead-letter queue for customer X	What’s failing?
Show me the reconciliation gaps from yesterday	What did reconciliation catch?
Show me API errors in the last hour	What’s broken right now?

The tooling doesn’t need to be fancy — even simple CLI scripts that hit the structured-log store and per-record database are sufficient for most debugging needs.

Reproduction without production

For investigating issues, the ability to re-run a polling cycle against historical state is valuable:

JavaScript

async function replayPollCycle(customerId, fromCheckpoint, toCheckpoint) {
  // Run the polling logic against the time window, but emit to a dry-run destination
  const dryRun = new DryRunDestination();

  const result = await runPollLogic({
    customerId,
    fromCheckpoint,
    toCheckpoint,
    destination: dryRun,
  });

  return {
    result,
    operationsThatWouldHaveRun: dryRun.recordedOperations,
  };
}

Replay against past time windows in a dry-run mode helps investigate “why didn’t this record get processed?” questions.

Putting it together: a polling reliability blueprint

A reference blueprint for production polling reliability: The components:

Component	Role
Polling worker	Calls VOMO; advances checkpoints; processes records
Durable checkpoint store	Per-customer-per-resource last-processed timestamps
Per-record processing	Idempotent; writes to destination + audit log
Dead-letter queue	Records that failed processing; retried separately
Reconciliation worker	Finds gaps; re-processes from DLQ; detects deletions
Per-record audit log	Records every processing attempt with outcome
Drift detection metrics	Aggregates audit log into health metrics
Alerts + dashboards	Surfaces problems to operators

The complexity is real, but each component does one well-defined job. The combination produces production-grade reliability that scales across many customers.

Common anti-patterns

A few patterns that look reasonable but cause production issues:

Anti-pattern: “the polling worker IS the integration”

Some integrations are built around a single all-in-one polling worker that does everything. When it breaks, everything breaks. Better: separate polling from processing. Polling enqueues changes; a separate worker processes them. Each can fail and recover independently.

Anti-pattern: “we’ll catch deletions someday”

Deletion detection is hard, so it’s often deferred. Then a customer asks why deleted volunteers still appear in their reports — and the answer is “we don’t sync deletions.” Better: build deletion detection from the start, even if it’s just weekly. The infrastructure is the same as full reconciliation; you’re doing it anyway.

Anti-pattern: “we’ll just re-sync everything when there’s a problem”

For small customers this works. For large customers, “re-sync everything” is hours of API calls and processing. Plan for incremental recovery, not just full reset.

Anti-pattern: “the audit log is just for compliance”

Audit logs become invaluable for debugging. Make them queryable, filterable, and indexed — not just write-only.

Anti-pattern: “if reconciliation finds gaps, auto-fix them”

Sometimes reconciliation finds gaps because polling has a bug. Auto-fixing hides the bug behind reconciliation’s automatic correction. Treat sustained reconciliation gaps as a signal to investigate, not just a checklist item to clear.

Anti-pattern: hardcoded cadences across customers

Different customers have different scales and needs. Hardcoded “every 15 minutes for everything” works initially but scales poorly. Make cadence per-customer-configurable from the start.

A maturity model

Where is your integration on the polling reliability maturity model?

Level	Characteristics
1: Working	Polling runs; data flows; occasional gaps tolerated
2: Monitored	Per-customer dashboards exist; obvious failures detected
3: Reconciled	Daily reconciliation catches gaps; deletion detection works
4: Audited	Per-record audit trail; sample-based drift detection; alerts on metrics
5: Self-healing	Reconciliation auto-corrects; failures trigger graduated responses; customer-facing health visibility

Most production integrations land between Level 3 and Level 4. Level 5 is reserved for the highest-stakes integrations (compliance-critical workflows, financial reporting, etc.). For each level, the previous level’s practices are foundational — you can’t skip from Level 1 to Level 4.

Production checklist

For a polling integration at Level 3+:

Checkpoints stored in durable, transactional storage
Checkpoints advanced only after successful per-record processing
Per-record processing operations are idempotent
Dedup keys protect non-idempotent operations
Per-resource cadences tuned to actual freshness needs
Daily incremental reconciliation runs for each resource
Weekly full reconciliation including deletion detection
Per-record audit trail with trace IDs
Per-customer dashboards exposing sync health
Alerts on stalled checkpoints, growing DLQ, elevated error rates, reconciliation gap rates
Cadences are per-customer-configurable
Replay/dry-run tooling exists for debugging
Documented runbook for common failure modes
On-call playbook for the most common alerts

Where to go next

Reconciliation Patterns

The companion page on the reconciliation patterns this page builds on.

Sync Architecture Patterns

The broader architectural patterns these practices fit into.

Error Recovery Patterns

The error-handling patterns that support this reliability model.

API Performance Tips

The performance patterns that keep polling efficient at scale.

​The five core practices

​Practice 1: Durable checkpointing

​Where to store checkpoints

​Atomic checkpoint advancement

​Checkpoint backup

​Recovery from a lost checkpoint

​Practice 2: Idempotency at every layer

​What makes an operation idempotent

​Building idempotency into processing

​When idempotency is impossible

​Practice 3: Drift detection

​What to measure

​Setting up alerts

​Dashboards over alerts

​Practice 4: Cost vs. reliability balance

​The cost curve

​Right-sizing cadences

​Per-resource cadence tuning

​Practice 5: Debuggability

​Structured logging

​Trace IDs across the pipeline

​Per-record audit trail

​Inspection tooling

​Reproduction without production

​Putting it together: a polling reliability blueprint

​Common anti-patterns

​Anti-pattern: “the polling worker IS the integration”

​Anti-pattern: “we’ll catch deletions someday”

​Anti-pattern: “we’ll just re-sync everything when there’s a problem”

​Anti-pattern: “the audit log is just for compliance”

​Anti-pattern: “if reconciliation finds gaps, auto-fix them”

​Anti-pattern: hardcoded cadences across customers

​A maturity model

​Production checklist

​Where to go next

Reconciliation Patterns

Sync Architecture Patterns

Error Recovery Patterns

API Performance Tips

The five core practices

Practice 1: Durable checkpointing

Where to store checkpoints

Atomic checkpoint advancement

Checkpoint backup

Recovery from a lost checkpoint

Practice 2: Idempotency at every layer

What makes an operation idempotent

Building idempotency into processing

When idempotency is impossible

Practice 3: Drift detection

What to measure

Setting up alerts

Dashboards over alerts

Practice 4: Cost vs. reliability balance

The cost curve

Right-sizing cadences

Per-resource cadence tuning

Practice 5: Debuggability

Structured logging

Trace IDs across the pipeline

Per-record audit trail

Inspection tooling

Reproduction without production

Putting it together: a polling reliability blueprint

Common anti-patterns

Anti-pattern: “the polling worker IS the integration”

Anti-pattern: “we’ll catch deletions someday”

Anti-pattern: “we’ll just re-sync everything when there’s a problem”

Anti-pattern: “the audit log is just for compliance”

Anti-pattern: “if reconciliation finds gaps, auto-fix them”

Anti-pattern: hardcoded cadences across customers

A maturity model

Production checklist

Where to go next