Modernizing Master Data Management: A Vector-First Approach to Entity Resolution at Scale

In the hierarchy of data engineering challenges, Entity Resolution (ER) sits near the top. It is the silent bottleneck in almost every Master Data Management (MDM) initiative. Whether you are merging CRM records, cleaning supply chain vendors, or detecting fraud, the fundamental question remains: Are “Acme Corp,” “Acme Inc.,” and “Acme Strategic Solutions” the same legal entity?

For decades, the industry relied on fuzzy matching algorithms like Levenshtein or Jaro-Winkler. Recently, there has been a massive pivot toward Generative LLMs (like GPT-4 or Llama 3) to solve the “semantic” gap in matching. However, as we discovered in our recent engineering sprint, Generative AI introduces a “resource war”—fighting for VRAM and compute—that makes it unscalable for high-volume pipelines.

This article outlines our architectural shift to a Vector-Native Architecture. By combining Apache Spark with Vector Embeddings, we achieved high-precision resolution with 100% stability, increasing throughput from ~15 pairs/second to over 2,000 pairs/second, all while running on standard commodity hardware.


The Problem: The “Resource War” of Generative ER

Traditional deterministic matching fails when data is semantically identical but syntactically distinct.

  • The Trap: A string distance algorithm sees G4S AMERICAS (UK) LIMITED versus G4S FINANCE LIMITED and scores them as a non-match (< 60% similarity) because of the different functional words (“Americas” vs. “Finance”).

To fix this, many engineering teams deploy Generative LLMs as “Judges.” They feed pairs to a model and ask, “Are these the same?” While accurate, this approach proved catastrophic at scale in our tests:

  1. Latency: Generative models generate tokens serially (character by character). Processing millions of rows can take days.
  2. Instability (The OOM Loop): We encountered severe Out of Memory (OOM) errors. The Spark JVM (managing data) and the LLM (managing weights) were competing for the same RAM, causing frequent crashes.
  3. Cost: Running 70B parameter models requires massive, expensive GPU clusters for what is essentially a simple boolean decision.

We realized we needed the intelligence of an LLM without the overhead of a chatbot.


The Solution: “Discrimination over Generation”

We re-architected the pipeline to move from Generative AI (creating text) to Discriminative AI (comparing mathematical positions).

1. The Intelligence Layer: Vector Embeddings

Instead of asking a model to “think” and write a response, we use an embedding model (specifically mxbai-embed-large) to translate entity names into high-dimensional vector space.

  • How it works: The model assigns coordinate numbers to text. In this vector space, “Corp” and “Limited” point in similar directions. The model identifies “G4S” as the dominant semantic signal, while treating “Americas” or “Finance” as secondary descriptors.
  • Batch Optimization: We implemented a Unique Value Batching strategy. By increasing the batch size to 512, we saturated the T4 GPU’s compute capability.
  • The Result: We processed 4,749 unique entity names in just 10 HTTP calls. This reduced network overhead by roughly 99% compared to standard row-by-row processing.

2. The Clustering Layer: Native Graph Resolution

Identifying that “Record A” matches “Record B” is only half the battle. You must resolve the Transitive Closure:

  • If A matches B…
  • And B matches C…
  • Then A, B, and C are the same entity.

We implemented a Connected Components algorithm natively in PySpark. By treating matches as “edges” and entities as “nodes,” the algorithm iteratively propagates the smallest Component ID across the network. This “snaps” disparate records into a single “Golden Record” ID that persists across the dataset.


Real-World Benchmark Results

We benchmarked this architecture on a Google Colab T4 (Tesla T4 GPU) environment against a raw dataset of 40,109 company records.

MetricGenerative LLM ApproachVector Embedding Approach
Dataset Scale40,109 records40,109 records
Ambiguous Candidates3,364 pairs3,364 pairs
StabilityFrequent OOM Crashes100% Stable
Throughput~15 pairs / sec~2,000+ pairs / sec
Network Efficiency3,364 API Calls10 API Calls (Batch Size 512)
AccuracyVariable (hallucinations)Deterministic

Case Study: The “G4S” Cluster

The system successfully unified complex clusters that defeated traditional algorithms. For example, it correctly grouped:

  • G4S AMERICAS (UK) LIMITED
  • G4S FINANCE LIMITED
  • G4S LIMITED

Despite the different functional suffixes and geographical tags, the vector space correctly identified the dominant “G4S” signal, achieving perfect resolution where string distance algorithms failed.


Financial Analysis: Cloud Cost Implications

For technical leaders, performance is critical, but cost is paramount. Below is a cost projection for running this pipeline on AWS for a 40,000 record batch.

1. Serverless (AWS Bedrock / Lambda)

  • Architecture: Spark triggers a Lambda function or calls the Bedrock API directly.
  • Cost Structure: Pay per token.
  • Calculation: 47,490 tokens (unique names) $\times$ $0.00013 / 1k tokens.
  • Estimated Cost: $0.006 (approx. half a penny).
  • Verdict: Ideal for sporadic batch jobs or getting started with zero operations overhead.

2. Dedicated Infrastructure (EC2 g4dn.xlarge)

  • Architecture: Self-hosted Spark + Ollama (Docker) on a GPU instance.
  • Cost Structure: Hourly compute ($0.526/hr for On-Demand).
  • Execution Time: ~10 minutes (Spin up, Install, Run, Spin down).
  • Estimated Cost: $0.09.
  • Verdict: Best for high-frequency jobs where data privacy is a concern (VPC isolation).

3. Provisioned Throughput (Enterprise Grade)

  • Architecture: Buying “Model Units” (Bedrock) or “Provisioned Concurrency” (SageMaker).
  • Cost Structure: Monthly or Hourly commitment (e.g., ~$800/month for 1 Model Unit).
  • Verdict: Prohibitive. Provisioned throughput is designed for applications requiring guaranteed latency (SLA) for millions of real-time transactions. For batch jobs like Entity Resolution, it is financial overkill.

Executive Takeaway

We have moved from a “brute force” AI approach to an engineered solution. By utilizing Vector Embeddings with High-Density Batching (512), we reduced the API footprint from thousands of calls to just 10 calls.

This architecture transforms Entity Resolution from a fragile, expensive experiment into a robust, nearly free capability. It proves that in the age of AI, the smartest solution isn’t always the biggest model—it’s the right model applied with sound engineering principles.