In this article we will use Locality Sensitive Hashing for finding similar Company names and will use data from Companies House as mentioned in the previous article. We will use PySpark pipeline to streamline the process of finding similar company names.
Background
Fuzzy/Approximate matching two strings means calculating how similar two strings are and one approach is to calculate the edit distance between the strings – that is, the number of changes (insertion, deletion, substitution) that means comparing both strings. A popular algorithm to calculate this distance is the Levenshtein distance algorithm, however this algorithm is too computationally expensive and slow for large datasets. Apache Spark‘s machine learning libraries in particular approxSimilarityJoin method to calculates the Jaccard Distance which can be used for finding similar Company names.
Use of PySpark Pipeline
We will use PySpark Machine Learning Pipeline to combine multiple algorithms:
- RegexTokenizer
- NGram
- HashingTF
- MinHashLSH
Code Snippets
df = spark.read.parquet("spark-warehouse/ch_20210404")
df.createOrReplaceTempView("ch_20210404")
# extract features
df1 = spark.sql("""
select CompanyNumber as id,concat_ws(' ',trim(regexp_replace(CompanyName,"LIMITED|LTD","")),
trim(RegAddressPostCode)) as names
from ch_20210404
where RegAddressPostCode is not null""").cache()
# use pipeline
model_pipeline = Pipeline(stages=[
RegexTokenizer(
pattern="", inputCol="names", outputCol="tokens", toLowercase=True
),
NGram(n=3, inputCol="tokens", outputCol="ngrams"),
HashingTF(inputCol="ngrams", outputCol="vectors",numFeatures = 2 << 20),
MinHashLSH(inputCol="vectors", outputCol="lsh",numHashTables = 4)
]).fit(df1)
# create hash dataframe
db_hashed = model_pipeline.transform(df1).cache()
# to find company with name BROOKSON and postcode WA1 1RG
df_query = spark.createDataFrame(["BROOKSON WA1 1RG"], "string").toDF("names")
source_hashed = model_pipeline.transform(df_query)
# use approxSimilarityJoin to find matches
matches_df = model_pipeline.stages[-1].approxSimilarityJoin(db_hashed, source_hashed, 0.8, "confidence")
matches_df.cache()
Find records with lowest confidence for similar company names
# Find records with lowest confidence for similar company names</p>
matches_df.select(F.col('datasetA.names').alias('Company_Names'),
F.col('datasetA.id').alias('CompanyNumber'),
F.col('datasetB.Names').alias('Source_CompanyNames'),
F.col('confidence')).filter('confidence <= 0.40').show(truncate=False)
Output
+----------------------+-------------+-------------------+-------------------+
|Company_Names |CompanyNumber|Source_CompanyNames|confidence |
+----------------------+-------------+-------------------+-------------------+
|BROOKSON 1989 WA1 1RG |08838314 |BROOKSON WA1 1RG |0.35 |
|BROOKSON GROUP WA1 1RG|05953666 |BROOKSON WA1 1RG |0.38095238095238093|
|BROOKSON WA1 1RG |03128631 |BROOKSON WA1 1RG |0.0 |
+----------------------+-------------+-------------------+-------------------+
It seems dataset contains duplicate company names