Locality Sensitive Hashing for finding similar Company names

In this article we will use Locality Sensitive Hashing for finding similar Company names and will use data from Companies House as mentioned in the previous article. We will use PySpark pipeline to streamline the process of finding similar company names.

Background

Fuzzy/Approximate matching two strings means calculating how similar two strings are and one approach is to calculate the edit distance between the strings – that is, the number of changes (insertion, deletion, substitution) that means comparing both strings. A popular algorithm to calculate this distance is the Levenshtein distance algorithm, however this algorithm is too computationally expensive and slow for large datasets. Apache Spark‘s machine learning libraries in particular approxSimilarityJoin method to calculates the Jaccard Distance which can be used for finding similar Company names.

Use of PySpark Pipeline

We will use PySpark Machine Learning Pipeline to combine multiple algorithms:

RegexTokenizer
NGram
HashingTF
MinHashLSH

Code Snippets

df = spark.read.parquet("spark-warehouse/ch_20210404")
df.createOrReplaceTempView("ch_20210404")

# extract features
df1 = spark.sql("""
select CompanyNumber as id,concat_ws(' ',trim(regexp_replace(CompanyName,"LIMITED|LTD","")),
  trim(RegAddressPostCode)) as names
from ch_20210404
where RegAddressPostCode is not null""").cache()

# use pipeline
model_pipeline = Pipeline(stages=[
    RegexTokenizer(
        pattern="", inputCol="names", outputCol="tokens", toLowercase=True
    ),
    NGram(n=3, inputCol="tokens", outputCol="ngrams"),
    HashingTF(inputCol="ngrams", outputCol="vectors",numFeatures = 2 << 20),
    MinHashLSH(inputCol="vectors", outputCol="lsh",numHashTables = 4)
]).fit(df1)

# create hash dataframe
db_hashed = model_pipeline.transform(df1).cache()

# to find company with name BROOKSON and postcode WA1 1RG
df_query = spark.createDataFrame(["BROOKSON WA1 1RG"], "string").toDF("names")
source_hashed = model_pipeline.transform(df_query)

# use approxSimilarityJoin to find matches
matches_df = model_pipeline.stages[-1].approxSimilarityJoin(db_hashed, source_hashed, 0.8, "confidence")
matches_df.cache()

Find records with lowest confidence for similar company names

# Find records with lowest confidence for similar company names</p>
matches_df.select(F.col('datasetA.names').alias('Company_Names'),
F.col('datasetA.id').alias('CompanyNumber'),
F.col('datasetB.Names').alias('Source_CompanyNames'),
F.col('confidence')).filter('confidence <= 0.40').show(truncate=False)

Output

+----------------------+-------------+-------------------+-------------------+
|Company_Names         |CompanyNumber|Source_CompanyNames|confidence         |
+----------------------+-------------+-------------------+-------------------+
|BROOKSON 1989 WA1 1RG |08838314     |BROOKSON WA1 1RG   |0.35               |
|BROOKSON GROUP WA1 1RG|05953666     |BROOKSON WA1 1RG   |0.38095238095238093|
|BROOKSON WA1 1RG      |03128631     |BROOKSON WA1 1RG   |0.0                |
+----------------------+-------------+-------------------+-------------------+

It seems dataset contains duplicate company names

Locality Sensitive Hashing for finding similar Company names

Background

Use of PySpark Pipeline

Code Snippets

More posts

Using AWS Cloudscape Design System to create a simple cost analysis dashboard

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Customer segmentation analysis using fake data and QuickSight with Amazon Q

Migrating data from SQL Server to PostgreSQL