In the hierarchy of data engineering challenges, Entity Resolution (ER) sits near the top. It is the silent bottleneck in almost every Master Data Management (MDM) initiative. Whether you are merging CRM records, cleaning supply chain vendors, or detecting fraud, the fundamental question remains: Are “Acme Corp,” “Acme Inc.,” and “Acme Strategic Solutions” the same legal entity?
For decades, the industry relied on fuzzy matching algorithms like Levenshtein or Jaro-Winkler. Recently, there has been a massive pivot toward Generative LLMs (like GPT-4 or Llama 3) to solve the “semantic” gap in matching. However, as we discovered in our recent engineering sprint, Generative AI introduces a “resource war”—fighting for VRAM and compute—that makes it unscalable for high-volume pipelines.
This article outlines our architectural shift to a Vector-Native Architecture. By combining Apache Spark with Vector Embeddings, we achieved high-precision resolution with 100% stability, increasing throughput from ~15 pairs/second to over 2,000 pairs/second, all while running on standard commodity hardware.
The Problem: The “Resource War” of Generative ER
Traditional deterministic matching fails when data is semantically identical but syntactically distinct.
- The Trap: A string distance algorithm sees
G4S AMERICAS (UK) LIMITEDversusG4S FINANCE LIMITEDand scores them as a non-match (< 60% similarity) because of the different functional words (“Americas” vs. “Finance”).
To fix this, many engineering teams deploy Generative LLMs as “Judges.” They feed pairs to a model and ask, “Are these the same?” While accurate, this approach proved catastrophic at scale in our tests:
- Latency: Generative models generate tokens serially (character by character). Processing millions of rows can take days.
- Instability (The OOM Loop): We encountered severe Out of Memory (OOM) errors. The Spark JVM (managing data) and the LLM (managing weights) were competing for the same RAM, causing frequent crashes.
- Cost: Running 70B parameter models requires massive, expensive GPU clusters for what is essentially a simple boolean decision.
We realized we needed the intelligence of an LLM without the overhead of a chatbot.
The Solution: “Discrimination over Generation”
We re-architected the pipeline to move from Generative AI (creating text) to Discriminative AI (comparing mathematical positions).
1. The Intelligence Layer: Vector Embeddings
Instead of asking a model to “think” and write a response, we use an embedding model (specifically mxbai-embed-large) to translate entity names into high-dimensional vector space.
- How it works: The model assigns coordinate numbers to text. In this vector space, “Corp” and “Limited” point in similar directions. The model identifies “G4S” as the dominant semantic signal, while treating “Americas” or “Finance” as secondary descriptors.
- Batch Optimization: We implemented a Unique Value Batching strategy. By increasing the batch size to 512, we saturated the T4 GPU’s compute capability.
- The Result: We processed 4,749 unique entity names in just 10 HTTP calls. This reduced network overhead by roughly 99% compared to standard row-by-row processing.
2. The Clustering Layer: Native Graph Resolution
Identifying that “Record A” matches “Record B” is only half the battle. You must resolve the Transitive Closure:
- If A matches B…
- And B matches C…
- Then A, B, and C are the same entity.
We implemented a Connected Components algorithm natively in PySpark. By treating matches as “edges” and entities as “nodes,” the algorithm iteratively propagates the smallest Component ID across the network. This “snaps” disparate records into a single “Golden Record” ID that persists across the dataset.
Real-World Benchmark Results
We benchmarked this architecture on a Google Colab T4 (Tesla T4 GPU) environment against a raw dataset of 40,109 company records.
| Metric | Generative LLM Approach | Vector Embedding Approach |
| Dataset Scale | 40,109 records | 40,109 records |
| Ambiguous Candidates | 3,364 pairs | 3,364 pairs |
| Stability | Frequent OOM Crashes | 100% Stable |
| Throughput | ~15 pairs / sec | ~2,000+ pairs / sec |
| Network Efficiency | 3,364 API Calls | 10 API Calls (Batch Size 512) |
| Accuracy | Variable (hallucinations) | Deterministic |
Case Study: The “G4S” Cluster
The system successfully unified complex clusters that defeated traditional algorithms. For example, it correctly grouped:
G4S AMERICAS (UK) LIMITEDG4S FINANCE LIMITEDG4S LIMITED
Despite the different functional suffixes and geographical tags, the vector space correctly identified the dominant “G4S” signal, achieving perfect resolution where string distance algorithms failed.
Financial Analysis: Cloud Cost Implications
For technical leaders, performance is critical, but cost is paramount. Below is a cost projection for running this pipeline on AWS for a 40,000 record batch.
1. Serverless (AWS Bedrock / Lambda)
- Architecture: Spark triggers a Lambda function or calls the Bedrock API directly.
- Cost Structure: Pay per token.
- Calculation: 47,490 tokens (unique names) $\times$ $0.00013 / 1k tokens.
- Estimated Cost: $0.006 (approx. half a penny).
- Verdict: Ideal for sporadic batch jobs or getting started with zero operations overhead.
2. Dedicated Infrastructure (EC2 g4dn.xlarge)
- Architecture: Self-hosted Spark + Ollama (Docker) on a GPU instance.
- Cost Structure: Hourly compute ($0.526/hr for On-Demand).
- Execution Time: ~10 minutes (Spin up, Install, Run, Spin down).
- Estimated Cost: $0.09.
- Verdict: Best for high-frequency jobs where data privacy is a concern (VPC isolation).
3. Provisioned Throughput (Enterprise Grade)
- Architecture: Buying “Model Units” (Bedrock) or “Provisioned Concurrency” (SageMaker).
- Cost Structure: Monthly or Hourly commitment (e.g., ~$800/month for 1 Model Unit).
- Verdict: Prohibitive. Provisioned throughput is designed for applications requiring guaranteed latency (SLA) for millions of real-time transactions. For batch jobs like Entity Resolution, it is financial overkill.
Executive Takeaway
We have moved from a “brute force” AI approach to an engineered solution. By utilizing Vector Embeddings with High-Density Batching (512), we reduced the API footprint from thousands of calls to just 10 calls.
This architecture transforms Entity Resolution from a fragile, expensive experiment into a robust, nearly free capability. It proves that in the age of AI, the smartest solution isn’t always the biggest model—it’s the right model applied with sound engineering principles.