The GTM Engineer's Guide to Record Matching

Every new lead asks one question: does this record already exist? Get matching wrong and you create duplicates or corrupt two records at once.

Overview

Record matching is the foundation of every data operation in a GTM stack. Every time a lead comes in through a form, a record syncs from an enrichment platform, or a CSV gets imported, your system needs to answer a deceptively simple question: does this entity already exist in our database? Get it wrong in one direction and you create duplicates that fragment engagement history and inflate pipeline. Get it wrong in the other direction and you merge records that should have stayed separate, corrupting the data for two different people or companies.

GTM Engineers sit at the intersection of this problem. You are building the matching logic that determines whether a new HubSpot form submission updates an existing contact or creates a new one, whether a Clay enrichment sync matches to the right Salesforce account, and whether two records from different systems actually represent the same person. This guide covers the matching techniques, implementation patterns, and failure modes you need to build reliable record matching across your GTM stack.

Deterministic vs. Probabilistic Matching

Every matching approach falls somewhere on a spectrum between two paradigms. Understanding the tradeoffs is critical for choosing the right strategy for each use case in your stack.

Deterministic Matching

Deterministic matching requires exact agreement on one or more fields. If two records share the same email address, they match. Period. There is no scoring, no thresholds, and no ambiguity.

Strengths:

Zero false positives when the match key is truly unique (email, LinkedIn URL, phone number)
Fast to compute — simple equality checks
Easy to explain and audit — "these records match because they share email X"
No tuning required — the logic is binary

Weaknesses:

Misses duplicates with even minor variations ("john@acme.com" vs "john.doe@acme.com")
Requires a strong unique identifier to be present on both records, which is not always the case
Cannot match records across systems that do not share a common identifier

Probabilistic Matching

Probabilistic matching assigns a similarity score based on multiple fields and declares a match when the score exceeds a threshold. Two records might match on company domain (high signal), have similar first names (moderate signal), and share a city (low signal) — the combination produces a confidence score that determines whether the match is accepted.

Strengths:

Catches matches that deterministic methods miss
Handles data quality issues (typos, abbreviations, missing fields)
Can match across systems without shared identifiers

Weaknesses:

Introduces false positives — requires careful threshold tuning
Harder to explain and audit — "why did these records match?"
Computationally more expensive, especially at scale

Use Both in Layers

The best matching systems use deterministic matching first to handle the easy, high-confidence matches, then apply probabilistic matching to the remaining unmatched records. This layered approach gives you the speed and precision of deterministic matching where it works, with the recall of probabilistic matching where it is needed. Most GTM teams find that deterministic matching on email address alone resolves 60-70% of their matching needs, with probabilistic methods handling the remaining 30-40%.

Fuzzy Matching Techniques for GTM Data

When deterministic matching fails, fuzzy matching picks up the slack. The key is choosing the right algorithm for the right data type — a technique that works well for person names may perform poorly on company names, and vice versa.

String Similarity Algorithms

Algorithm	Best For	Limitation	Example Match
Levenshtein distance	Catching typos, minor spelling variations	Struggles with reordered words or abbreviations	"Jon Smith" / "John Smith" (distance: 1)
Jaro-Winkler	Person names (prefix-weighted)	Less effective for company names with common prefixes	"Martha" / "Marhta" (score: 0.96)
Token sort ratio	Company names with word reordering	Ignores word importance	"Acme Corp Inc" / "Inc Acme Corp" (score: 100)
Token set ratio	Company names with extra/missing words	Can over-match when tokens are common	"Acme Corp" / "Acme Corporation Ltd" (score: 100)
Soundex / Metaphone	Name pronunciation variations	Only works for phonetically similar strings	"Stephen" / "Steven" (same code)

Field-Specific Matching Rules

Different fields require different matching approaches. A generic similarity algorithm applied uniformly across all fields produces poor results.

Email addresses: Normalize by lowercasing, stripping plus-suffixes (user+tag@domain.com becomes user@domain.com), and comparing on the local part + domain separately. Two emails with the same domain and a Jaro-Winkler score above 0.85 on the local part are likely the same person.
Company names: Strip legal suffixes (Inc, LLC, Ltd, Corp, GmbH), normalize whitespace and punctuation, then use token set ratio. "The Acme Corporation, Inc." and "Acme Corp" should produce a high match score after normalization.
Person names: Handle name order variations (first-last vs last-first), honorifics (Mr., Dr.), and middle names/initials. "J. Robert Smith" and "Robert Smith" should match. Use Jaro-Winkler for the individual name components.
Phone numbers: Normalize to E.164 format before comparison. Strip formatting characters. Match on the last 10 digits if country code is ambiguous.
Addresses: Use a dedicated address parsing library to extract components (street, city, state, zip) and compare them individually. "123 Main St, Suite 200" and "123 Main Street #200" should match.

Cross-System Identity Matching

Matching records within a single system is relatively straightforward because you have consistent field names and data formats. Cross-system matching is harder because each system has its own schema, its own data quality characteristics, and its own version of the truth.

The Cross-System Matching Problem

Consider a typical GTM stack: a contact exists in Salesforce (the CRM), in HubSpot (marketing automation), in Outreach (sequencer), in Clay (enrichment), and in Snowflake (analytics). Each system stores different attributes, updates at different cadences, and may have a different version of the contact's data. Matching across these systems requires a shared identity strategy.

Identity Key Strategies

Email as universal key. The simplest cross-system identity strategy is using email address as the primary key across all systems. When a record is created in any system, it is matched to existing records in other systems by email. This works well for contacts but fails for accounts and breaks when contacts have multiple email addresses or when email is not available.

CRM ID propagation. Assign the CRM record ID as the canonical identifier and propagate it to all connected systems. When Salesforce creates a contact with ID "003xxxx", that ID is synced to HubSpot, Outreach, and Clay as a custom field. All cross-system matching uses this ID. This is reliable but requires that the CRM is always the system of record, and it breaks for records that exist in downstream systems but have not yet been synced to the CRM.

Identity graph. Build or use a dedicated identity resolution system that maintains a mapping between all system-specific IDs and a universal ID. When a record is created or updated in any system, the identity graph matches it to a universal identity and propagates updates to all connected systems. This is the most robust approach but requires dedicated infrastructure. See our Identity Resolution guide for deep coverage of this pattern.

Handling Conflicts

Cross-system matching inevitably surfaces conflicts — two systems have different values for the same field on the same record. Your matching system needs conflict resolution rules:

Source priority: Designate authoritative sources for each field. The CRM is authoritative for deal data, the enrichment platform for firmographics, and the marketing platform for engagement data.
Recency wins: For fields without a clear authoritative source, use the most recently updated value. This requires tracking update timestamps across systems.
Manual review queue: When conflicts cannot be resolved automatically (e.g., two systems show different job titles updated on the same day), route to a human reviewer. Build this queue into your ops team's workflow.

Match Confidence Matters

Not all matches deserve the same level of trust. A match based on exact email is near-certain. A match based on fuzzy company name + fuzzy first name + same city is probable but not certain. Attach a confidence score to every match and use it to determine the action: high-confidence matches auto-merge, medium-confidence matches update but flag for review, low-confidence matches queue for manual verification. Never auto-merge on a low-confidence match — the cost of a false merge (corrupting two good records) is far higher than the cost of a missed match (maintaining a duplicate).

Implementation Patterns

Building a reliable matching system requires more than choosing the right algorithms. The architecture of your matching pipeline determines its performance, accuracy, and maintainability.

Blocking for Performance

Pairwise comparison of every record against every other record is computationally impractical beyond a few thousand records. Blocking divides your dataset into groups (blocks) that share a common attribute and only compares records within the same block. Common blocking strategies for GTM data:

Domain blocking: Only compare records that share the same email domain or company domain. This dramatically reduces comparisons while preserving most true matches.
First-letter blocking: Only compare records whose last names start with the same letter. Less precise than domain blocking but useful when domain is not available.
Composite blocking: Use multiple blocking passes with different keys. First pass: block on domain. Second pass: block on normalized company name initial + state. This catches matches that any single blocking key would miss.

Match-Review-Merge Pipeline

A production matching system should follow a clear pipeline:

Candidate generation: Use blocking to identify candidate pairs
Scoring: Apply matching algorithms to score each candidate pair
Classification: Categorize matches as auto-merge (score > 90), review (60-90), or non-match (< 60)
Human review: Route medium-confidence matches to a reviewer with context about why the match was flagged
Merge execution: Apply merge rules and execute in the target system
Audit logging: Record every match decision for debugging and threshold tuning

FAQ

What is a good match rate to target?

Match rate depends on your use case. For ingestion-time dedup (new records against existing), a 20-40% match rate is typical — meaning 20-40% of incoming records match an existing record and should update rather than create. If your match rate is significantly lower, your matching logic may be too strict. If it is significantly higher, you may be over-matching. Monitor false positive and false negative rates alongside the raw match rate.

How do we handle matching when records have very little data?

Sparse records are the hardest matching challenge. If a record has only a first name and company name, your confidence in any match will be low. The practical approach is to require a minimum data threshold for matching — for example, at least two identifying fields present and at least one strong match (email or phone). Records below the threshold go into a quarantine queue for enrichment before matching.

Should matching logic be centralized or distributed across systems?

Centralized matching is almost always better. When each system runs its own matching logic with its own rules and thresholds, you get inconsistent results — a record might match in HubSpot but not in Salesforce. Centralize your matching in a single service or platform that all systems call before creating or updating records. This ensures consistent identity resolution across your entire stack.

How do we match contacts who have changed companies?

Job changes are a special case. The person is the same, but the account association has changed. Your matching logic should detect this: same name + same personal email or phone + different company domain signals a job change, not a new person. When detected, update the contact's company association and trigger job-change outreach workflows rather than creating a new record.

What Changes at Scale

Matching 10,000 records against each other with a simple Python script takes minutes. Matching 1,000,000 records across five systems in real-time as data flows through your stack is a fundamentally different problem. The matching logic itself does not change, but the infrastructure requirements do — you need blocking strategies to reduce comparison space, caching to avoid redundant lookups, and asynchronous processing to handle volume without slowing down your ingestion pipelines.

The bigger challenge is maintaining matching consistency across an expanding stack. Every new tool you add — a new enrichment provider, a new marketing channel, a new product analytics platform — introduces a new source of records that needs to be matched against your existing identity graph. Building and maintaining point-to-point matching between every pair of systems becomes untenable as the number of systems grows.

Octave is an AI platform designed to automate and optimize outbound playbooks, and its Enrich Agent handles record matching as a core function. The Enrich Company and Enrich Person Agents pull in detailed profiles with product fit scores, and the Library's reference customers are auto-matched to prospects -- ensuring that enrichment and matching happen as part of the same workflow rather than as separate infrastructure concerns. For teams running prospecting and enrichment at volume through Octave's native Clay integration, record matching is built into the pipeline rather than bolted on as an afterthought.

Conclusion

Record matching is the invisible foundation of GTM data operations. Every deduplication, every enrichment sync, every cross-system integration depends on correctly answering the question "is this the same entity?" Build your matching system with a layered approach — deterministic first, probabilistic second — and invest in field-specific matching rules that account for the quirks of each data type. Attach confidence scores to every match and use them to gate automated actions. Most importantly, centralize your matching logic so that every system in your stack is working from the same identity truth. The teams that get matching right build everything else on a solid foundation. The ones that do not spend their time reconciling conflicting data across fragmented systems.

The foundation of agentic GTM

Try for free

Try Octave