All Posts

Error Handling and Retry Logic for Production APIs

Your enrichment pipeline processed 4,000 leads flawlessly on Tuesday. On Wednesday, it silently dropped 312.

Error Handling and Retry Logic for Production APIs

Published on
February 26, 2026

Overview

Your enrichment pipeline processed 4,000 leads flawlessly on Tuesday. On Wednesday, it silently dropped 312. The CRM API returned a 503 at 2:47 AM, your code treated it the same as a 400 bad request, and those leads never got enriched, never got scored, and never made it into a sequence. Nobody noticed until Thursday afternoon when an AE asked why their pipeline had gone dry.

This is what happens when production APIs meet the real world. Networks are unreliable. Services go down for maintenance. Rate limits kick in during peak hours. Authentication tokens expire at 3 AM. Every integration point in your GTM stack is a potential failure, and how you handle those failures determines whether your automation pipelines run themselves or require constant babysitting.

This guide covers the error handling and retry patterns that separate production-grade GTM integrations from fragile prototypes. We will walk through error categorization, exponential backoff, circuit breakers, dead letter queues, idempotency keys, and the monitoring infrastructure that ties it all together. If you are building or maintaining workflows that touch CRMs, enrichment providers, sequencers, or AI endpoints, these are the patterns that keep data flowing when things inevitably break.

Error Categorization: The Foundation of Smart Retry Logic

The single most important decision in error handling is whether to retry. Retrying a transient network timeout makes sense. Retrying a malformed request with bad data is a waste of API quota and time. Yet most GTM integrations treat all errors the same: either retry everything or fail on everything. Both approaches are wrong.

Transient Errors: Retry These

Transient errors are temporary failures caused by conditions that will likely resolve on their own. Your code is fine. Your data is fine. The infrastructure between you and the API had a bad moment.

  • HTTP 429 (Too Many Requests): You hit a rate limit. Wait and try again. This is the most common transient error in GTM stacks, especially when coordinating multiple enrichment providers and API quotas.
  • HTTP 500 (Internal Server Error): The server failed to process your valid request. Usually a bug on their side or a temporary overload.
  • HTTP 502/503/504 (Gateway/Unavailable/Timeout): The service is temporarily down or overloaded. These typically resolve within minutes.
  • Connection timeouts and resets: Network-level failures that have nothing to do with your request or the API's business logic.
  • DNS resolution failures: Temporary DNS issues that resolve on retry.

Permanent Errors: Do Not Retry These

Permanent errors indicate a fundamental problem with the request itself. Retrying will produce the same failure every time, wasting your API quota and delaying processing of records that would succeed.

  • HTTP 400 (Bad Request): Your request payload is malformed. Fix the data, do not retry the same broken request.
  • HTTP 401 (Unauthorized): Your credentials are invalid or expired. Retrying with the same token will fail forever. You need to refresh the token or rotate credentials first.
  • HTTP 403 (Forbidden): You lack permission for this resource. This is a configuration issue, not a transient one.
  • HTTP 404 (Not Found): The resource does not exist. Retrying will not create it.
  • HTTP 422 (Unprocessable Entity): The request structure is valid but the data fails business validation. A duplicate email, a required field missing, a value outside accepted range.

The Gray Zone: Context-Dependent Errors

Some errors require judgment based on context:

ErrorRetry?Reasoning
HTTP 401 UnauthorizedOnce, after token refreshIf you have a refresh token, attempt one token refresh then retry. If the refresh also fails, it is a permanent error.
HTTP 409 ConflictDependsA CRM duplicate detection error is permanent (the record exists). A concurrent modification conflict may resolve on retry.
HTTP 500 with specific error bodyRead the bodySome APIs return 500 for data validation failures. If the error message says "invalid field value," it is functionally a 400.
Timeout after 30 secondsYes, with cautionThe request may have been processed. You need idempotency (covered below) to prevent duplicates.
Build an Error Classification Map

For each API in your stack, document the specific error codes and messages you encounter and classify each as transient or permanent. HubSpot's 429 errors include a Retry-After header. Salesforce returns different 500 errors for server issues versus data problems. Outreach has specific error codes for prospect deduplication. This map becomes your retry logic's decision engine and saves hours of debugging when something new surfaces.

Exponential Backoff: Retry Without Making Things Worse

Once you have decided an error is retryable, the question becomes when to retry. The naive approach is to retry immediately. The result is predictable: if the API is overloaded and returning 503s, hammering it with immediate retries makes the problem worse for everyone, including you.

The Core Pattern

Exponential backoff increases the wait time between each retry attempt. The standard formula is:

wait_time = min(base_delay * (2 ^ attempt) + random_jitter, max_delay)

With a 1-second base delay, this produces:

AttemptBase WaitWith Jitter (typical)Cumulative Time
1st retry2s2.0-3.0s~2.5s
2nd retry4s4.0-5.0s~7s
3rd retry8s8.0-9.0s~15s
4th retry16s16.0-17.0s~32s
5th retry32s32.0-33.0s~65s

Why Jitter Is Not Optional

Jitter (adding a random delay on top of the calculated wait) prevents the thundering herd problem. Imagine your enrichment pipeline processes 500 leads through a Clay table. They all hit the downstream API's rate limit at the same time and all get a 429. Without jitter, every single request retries at exactly the same moment, immediately triggering the rate limit again. With jitter, retries spread across a time window, and most succeed on the first retry attempt.

Use full jitter for best results: wait_time = random(0, base_delay * 2 ^ attempt). This produces a wider spread than adding jitter on top of the base calculation and works better under high concurrency.

Respect Retry-After Headers

Many APIs include a Retry-After header in their 429 and 503 responses. This tells you exactly when the server is ready to accept requests again. Always check for this header before falling back to your calculated backoff. HubSpot, Salesforce, and most major platforms provide this, and it is more accurate than any formula you can write because the server knows its own state.

Set Maximums

Always cap both the maximum number of retries (5-7 for most GTM use cases) and the maximum wait time (30-60 seconds). Without caps, exponential backoff can produce absurd wait times: the 10th retry would wait over 17 minutes with a 1-second base. If an API is still failing after 5 attempts with exponential backoff, the problem is not transient, and the request should move to a dead letter queue for investigation.

Per-Platform Tuning

Different APIs warrant different backoff configurations. The right settings depend on the platform's rate limit model and your pipeline's latency tolerance:

  • CRM APIs (HubSpot, Salesforce): Conservative backoff with 2-second base delay. CRM updates are rarely time-critical to the second, and burning through your daily quota on retries starves other workflows. See our guide on rate limiting strategies for GTM engineers for quota budgeting approaches.
  • Enrichment APIs (Clay providers, Apollo, ZoomInfo): Standard backoff with 1-second base. Enrichment can tolerate latency but watch your per-minute limits carefully.
  • Sequencer APIs (Outreach, Salesloft): Aggressive retry with 500ms base delay. Sequence enrollment timing matters. If a prospect should enter a sequence now, a 60-second delay is acceptable. A 10-minute delay means missed timing windows.
  • Real-time webhooks: Minimal backoff (200ms base) with very few retries (2-3 max). If a webhook-triggered outbound action cannot complete quickly, queue it for async processing rather than blocking the webhook response.

The Circuit Breaker Pattern: Stop Hitting a Dead API

Exponential backoff handles individual request failures. Circuit breakers handle systemic failures. When an API is genuinely down, not just slow, continuing to send requests (even with backoff) is wasteful. You burn API quota, block your processing queue, and delay everything else waiting behind the failing requests.

How Circuit Breakers Work

A circuit breaker tracks the health of each API endpoint and operates in three states:

1
Closed (Normal Operation): Requests flow through normally. The circuit breaker monitors the failure rate. If failures exceed a threshold (for example, 50% of the last 20 requests fail), the breaker trips open.
2
Open (API Down): All requests to this endpoint fail immediately without actually being sent. This is the key benefit: instead of waiting for a timeout on every request, the breaker short-circuits the call. Queued work for this endpoint is paused, and other endpoints continue processing normally.
3
Half-Open (Testing Recovery): After a configured cooldown period (typically 30-60 seconds), the breaker allows a single test request through. If it succeeds, the breaker closes and normal processing resumes. If it fails, the breaker reopens for another cooldown cycle.

Why This Matters for GTM Pipelines

Consider a typical enrichment and routing workflow: Clay enriches a lead, then your pipeline scores it, writes it to HubSpot, and enrolls it in an Outreach sequence. If HubSpot is down, without a circuit breaker every lead sits in the pipeline waiting for the HubSpot write to eventually time out. With a circuit breaker, the HubSpot write fails instantly, the lead routes to a retry queue for later CRM sync, and the scoring and sequence enrollment continue on schedule.

This is especially important when coordinating Clay, CRM, and sequencer in one flow. A failure in one system should not block progress in the others.

Configuration Guidelines

ParameterRecommended SettingWhy
Failure threshold50% over last 10-20 requestsTolerates occasional errors without tripping on one-off failures
Open state duration30-60 secondsLong enough for most transient outages to resolve; short enough to resume quickly
Half-open test count1-3 requestsEnough to confirm recovery without flooding a fragile service
Monitoring windowLast 60 seconds of requestsRecent enough to be relevant; long enough to smooth out noise
Circuit Breakers Per Endpoint, Not Per Service

A service might have one endpoint down while others work fine. Salesforce's bulk API could be struggling while the REST API is responsive. Configure circuit breakers at the endpoint level, not the service level, to avoid unnecessarily blocking healthy operations.

Dead Letter Queues: Where Failed Records Go to Wait, Not Die

Exponential backoff and circuit breakers handle the happy path of transient failures that eventually resolve. Dead letter queues (DLQs) handle the unhappy path: requests that exhaust all retries and still have not succeeded. Without a DLQ, these records simply vanish. With one, they are preserved for investigation and reprocessing.

Why Every GTM Pipeline Needs a DLQ

In a GTM context, a "lost" record is not an abstract system concern. It is a real lead that never got enriched, a deal update that never synced, or a sequence enrollment that never happened. When your pipeline maintenance routine only catches failures you know about, the records that silently disappear are the ones that hurt the most.

A DLQ captures these records along with the metadata needed to diagnose and fix the problem:

  • The original request payload: Exactly what was sent to the API.
  • The sequence of errors: Every error response from every retry attempt.
  • Timestamps: When the first attempt was made and when retries were exhausted.
  • Context: Which workflow generated this request, which lead or account it relates to, and what downstream actions depend on its success.

DLQ Processing Workflow

A DLQ is only useful if someone actually processes it. Build a workflow around it:

1
Alert on new DLQ entries: Trigger a Slack notification or email when the DLQ depth exceeds zero. Do not wait for it to grow. A single DLQ entry might indicate a systematic problem that affects every subsequent request.
2
Categorize failures: Group DLQ entries by error type. Ten entries with "401 Unauthorized" means a credential problem. Ten entries with "422 field_required: company_name" means a data quality issue upstream.
3
Fix the root cause: Do not just replay the DLQ. Fix the underlying issue first. Refresh the expired token, add the missing field mapping, or adjust the payload format.
4
Replay and verify: Reprocess the DLQ entries through the fixed pipeline and verify each one succeeds. Most queue systems (SQS, BullMQ, even a Postgres table) support replaying messages from the DLQ back to the main queue.

Implementation Approaches

For most GTM engineering teams, a DLQ does not require complex infrastructure:

  • Postgres table: A simple dead_letter_queue table with columns for payload, error history, created_at, and status works for low-to-medium volume. Query it, fix problems, update the status to "requeued."
  • AWS SQS DLQ: If you are already using SQS for your processing queue, SQS has native DLQ support. Messages that exceed the maximum receive count automatically move to a designated DLQ.
  • Google Sheet (seriously): For teams not yet running custom infrastructure, a Google Sheet that captures failed records via a webhook endpoint can serve as a lightweight DLQ. It is searchable, shareable, and requires zero ops overhead.

Idempotency Keys: Retry Safely Without Creating Duplicates

Here is the nightmare scenario: your pipeline sends a request to create a contact in HubSpot. The request times out after 30 seconds. Did HubSpot create the contact before the connection dropped, or not? You do not know. If you retry and the contact was already created, you now have a duplicate. If you do not retry and it was not created, you have a lost lead.

What Idempotency Keys Do

An idempotency key is a unique identifier attached to a request that tells the API: "If you have already processed a request with this key, return the original result instead of processing it again." This makes retries safe. You can send the same request five times, and the API will only act on it once.

Most major APIs support some form of idempotency:

  • Stripe: The gold standard. Pass an Idempotency-Key header, and Stripe caches the response for 24 hours.
  • HubSpot: Batch operations include deduplication based on record identifiers. For single creates, use the objectId or a unique property as a natural idempotency key.
  • Salesforce: The External ID field on any object serves as a natural idempotency key for upsert operations.

Implementing Client-Side Idempotency

When an API does not natively support idempotency keys, you need to implement it on your side. The pattern is straightforward:

1
Generate a deterministic key: Before sending a request, generate a unique key based on the operation's essential properties. For a contact creation, this might be a hash of email + source + timestamp. For a deal update, it could be deal_id + field_name + new_value.
2
Check before sending: Query your processed-operations store (a database table or Redis set) for this key. If it exists, the operation was already completed. Skip it.
3
Record after success: After the API confirms the operation succeeded, record the idempotency key with a TTL (time-to-live) appropriate for your use case. 24-72 hours covers most retry windows.

Idempotency in Multi-Step Pipelines

GTM pipelines are rarely a single API call. A lead goes through enrichment, scoring, CRM write, and sequence enrollment. Each step needs its own idempotency consideration. The lead might successfully write to the CRM but fail on sequence enrollment. When you retry the pipeline for that lead, the CRM write needs to be idempotent (upsert, not insert) while the sequence enrollment needs to proceed.

Track idempotency at the step level, not the pipeline level. Each step gets its own key and its own completion record. This lets you resume a pipeline from the exact point of failure rather than reprocessing successfully completed steps.

The Webhook Duplicate Problem

Idempotency is especially critical for webhook processing. Webhook providers retry on timeout, network errors, or non-2xx responses. Your endpoint might receive the same event 3-5 times. Without idempotency, a "deal closed" webhook processed multiple times could trigger duplicate commission calculations, duplicate onboarding emails, and duplicate Slack notifications. Always deduplicate incoming webhooks using the event ID provided by the sender.

Monitoring and Alerting: See Failures Before They Compound

The patterns above prevent data loss and handle failures gracefully. Monitoring tells you that failures are happening in the first place. Without it, your beautifully engineered retry logic and DLQs operate in the dark, and you only discover problems when the business impact becomes visible.

Key Metrics for API Error Monitoring

Not every metric matters equally. Focus on these for GTM pipeline observability:

MetricWhat It RevealsWarning ThresholdCritical Threshold
Error rate by endpointWhich APIs are failing and how often> 2% of requests> 10% of requests
Retry rateHow often your backoff logic is activating> 5% of requests need retries> 20% of requests need retries
Circuit breaker state changesAPI outages and recoveriesAny open eventOpen for > 5 minutes
DLQ depthRecords that exhausted all retries> 0 entries> 50 entries
P95 response latencyAPI slowdowns before they become timeouts> 3x normal> 10x normal
Successful throughputHow many records are actually processing end-to-end> 10% drop from baseline> 30% drop from baseline

Alerting That Does Not Create Noise

Alert fatigue kills monitoring effectiveness. If your Slack channel sends a message for every 429 response, the team mutes the channel within a week. Instead, design tiered alerts:

  • Informational (logged, not alerted): Individual retries, transient 429s that resolve on first backoff, circuit breaker state transitions that recover within 30 seconds.
  • Warning (Slack message): Error rates sustained above threshold for 5+ minutes, DLQ entries, circuit breaker open for more than 60 seconds.
  • Critical (page someone): DLQ depth growing rapidly, all retries exhausting on a critical endpoint, authentication failures that require manual credential rotation.

Build an Error Dashboard

Even a simple dashboard pays for itself immediately. Track error rates per API per hour, overlay them with throughput, and you will spot patterns that point alerting would miss: a gradual degradation that never crosses a threshold but indicates an impending failure, or a time-of-day correlation that suggests scheduling changes.

If you are running monitoring and alerting for AI-powered pipelines, extend your existing dashboards to include API health. The same Datadog, Grafana, or even Google Sheet infrastructure that tracks pipeline throughput can track error rates with minimal additional setup.

Correlate Errors with Business Impact

The most effective error dashboards translate technical failures into business metrics. "We had a 15% error rate on HubSpot writes between 2-4 AM" is a technical fact. "47 leads from yesterday's campaign were not added to CRM and missed their sequence enrollment window" is a business impact that drives prioritization. Connect your error monitoring to your analytics pipeline to close this gap.

Putting It All Together: A Production Error Handling Architecture

Individual patterns are useful. The real value comes from combining them into a cohesive architecture. Here is how the pieces fit together for a typical GTM integration pipeline.

Request Lifecycle

Every outbound API request in your pipeline should follow this flow:

1
Check circuit breaker state: Before sending the request, check if the circuit breaker for this endpoint is open. If it is, skip the request and route it to the retry queue immediately.
2
Check idempotency: Look up the idempotency key. If this exact operation was already completed successfully, return the cached result and move on.
3
Send the request: Make the API call with appropriate timeout settings.
4
Classify the response: Success, transient error, or permanent error.
5
On success: Record the idempotency key. Update the circuit breaker health metrics. Proceed to the next pipeline step.
6
On transient error: Apply exponential backoff with jitter. Respect Retry-After headers. Update circuit breaker failure count. After max retries, send to DLQ.
7
On permanent error: Log the error with full context. Send directly to DLQ (no retries). Alert if the error type is unexpected.

Common Architecture Mistakes

Even with all the right patterns, implementation details trip teams up:

  • Retry loops without deduplication: Your retry logic queues a failed request. The queue consumer picks it up and retries. The retry fails and re-queues. Without tracking retry count per request, you can create infinite retry loops that consume all your processing capacity.
  • Shared retry queues across unrelated pipelines: A flood of failures from one API buries retry attempts from another. Use separate retry queues per pipeline or per target API so failures in one system do not starve others.
  • DLQ without alerting: A DLQ that nobody checks is a records graveyard, not a recovery mechanism. Alert on every entry. Make DLQ review part of your daily pipeline maintenance.
  • Hardcoded retry configurations: Different APIs, different times of day, and different workflow priorities all warrant different retry behavior. Make backoff parameters, retry counts, and circuit breaker thresholds configurable per integration.

Beyond Individual Error Handlers

The patterns in this guide work well when you are managing error handling for two or three API integrations. But a real GTM stack does not have two or three integrations. It has Clay pulling from half a dozen enrichment providers, a CRM that every workflow touches, a sequencer that needs reliable enrollment, a data warehouse for analytics, and a growing list of AI endpoints for scoring and personalization. Each integration needs its own error classification, its own backoff tuning, its own circuit breaker, and its own DLQ processing.

At this scale, the error handling layer itself becomes the problem. You are not just writing business logic anymore. You are maintaining a distributed systems infrastructure across dozens of API connections, each with its own failure modes, its own rate limits, and its own retry semantics. Every new tool added to the stack multiplies the surface area. The team that was building pipeline automation is now spending half its time on plumbing.

What you need is a coordination layer that handles this complexity centrally: one system that understands the health and rate limits of every API in your stack, manages retry logic and circuit breakers across all of them, maintains idempotency for every operation, and routes failures to a unified DLQ with the context needed for fast diagnosis. Instead of building error handling into every individual integration, you build it once at the orchestration layer.

This is what platforms like Octave are designed to handle. Octave sits between your GTM workflows and the downstream APIs they depend on, providing a unified reliability layer that handles error classification, retry orchestration, and failure recovery across your entire stack. For teams running high-volume automated outbound, it means your enrichment data, CRM writes, and sequence enrollments all flow through infrastructure that already knows how to handle every failure mode, so your team can focus on the GTM logic that actually generates pipeline.

FAQ

How many retry attempts should I allow before sending a request to the dead letter queue?

Five to seven retries with exponential backoff covers most transient failures. With a 1-second base delay, five retries means you have waited roughly 60 seconds total. If the API is still failing after that, the issue is unlikely to resolve in the next few minutes. For rate-limit-specific errors where the Retry-After header indicates a longer wait, you might allow additional retries with longer delays, but cap the total retry window at 10-15 minutes for most GTM workflows.

Should I build error handling logic into my automation platform (Make, n8n) or in custom code?

Both platforms have built-in error handling, but it tends to be basic. Make's error handler routes and n8n's try/catch nodes work well for simple retry-on-failure logic. For production pipelines that need circuit breakers, idempotency, DLQs, and per-endpoint backoff tuning, you will likely need custom code, either as a wrapper around your API calls or as a dedicated middleware layer. Many teams use their automation platform for workflow orchestration and custom code for the API client layer.

How do I handle errors in batch API operations where some records succeed and others fail?

Partial batch failures require per-record error handling. Parse the batch response to identify which records succeeded and which failed. Log successful records, route failed records to your retry queue individually (not as a re-batch), and classify each failure independently. A batch of 100 records might have 97 successes, 2 rate limit errors (retry), and 1 validation error (DLQ). Treating them all the same wastes either time or data.

What is the difference between a retry queue and a dead letter queue?

A retry queue holds requests that failed but are expected to succeed on a subsequent attempt. Items move back to the main processing queue after a delay. A dead letter queue holds requests that have exhausted all retry attempts and require manual investigation. The retry queue is automated recovery. The DLQ is a last resort that preserves data that would otherwise be lost. Every production pipeline should have both.

How do I test error handling without waiting for real API failures?

Three approaches: First, use mock servers that return configurable error responses. Tools like WireMock or Mockoon let you simulate 429s, 500s, and timeouts on demand. Second, inject artificial delays and failures in your API client layer using feature flags (sometimes called chaos engineering lite). Third, test against sandbox environments during known maintenance windows. The goal is to verify that your retry logic, circuit breakers, and DLQs all function correctly before a real failure hits production at 3 AM.

Should I retry on HTTP 500 errors or treat them as permanent failures?

Retry, but with caution. Most 500 errors are transient: a brief server overload, a database timeout, a deployment in progress. However, some APIs return 500 for what should be 400-level errors (bad data, unsupported operations). Check the response body for clues. If the error message references your input data, treat it as permanent. If it is a generic server error, retry with exponential backoff. Track 500 error patterns per API over time to refine your classification.

Conclusion

Production error handling is not a feature you ship once. It is an operational discipline that evolves as your stack grows, your volume increases, and new failure modes emerge. The patterns covered here, error categorization, exponential backoff, circuit breakers, dead letter queues, idempotency keys, and monitoring, form a layered defense that keeps your GTM data flowing when the underlying infrastructure is anything but reliable.

Start with error categorization and exponential backoff. These two patterns alone eliminate the majority of silent data loss in GTM pipelines. Add idempotency keys to protect against duplicate processing, which matters most for CRM writes and sequence enrollments. Build out circuit breakers when you have multiple integrations competing for processing capacity. Set up monitoring from the beginning, even if it is just a Slack alert on DLQ entries.

The teams that invest in this infrastructure early are the ones whose automated pipelines actually run hands-off. Everyone else discovers the gaps the hard way: three days after the failure, when the damage has already compounded and recovery means manually reprocessing hundreds of records. Build the error handling layer now, and future you will appreciate the quiet mornings.

FAQ

Frequently Asked Questions

Still have questions? Get connected to our support team.