Near duplicate detection using Locality Sensitivity Hashing (LSH)

Locality sensitive hashing (LSH) is a method for finding similar pairs in a large dataset. For a dataset of size N, the brute force method of comparing every possible pair would take N!/(2!(N-2)!) ~ N²/2 = O(N²) time. The LSH method aims to cut this down to O(N) time.

In this article we will show how to find near duplicate names in Companies House dataset.

STEPS FOR DETECTING DUPLICATES:

  • read companies house data into a dataframe
# read CompaniesHouse data
df = spark.read.parquet("spark-warehouse/ch_20210404")
df.createOrReplaceTempView("ch_20210404")

# create dataframe to return CompanyName in an array
df1 = spark.sql("""
select CompanyNumber as id,split(regexp_replace(CompanyName,"LIMITED|LTD"," ")," ") as names
from ch_20210404""")
  • create pipeline for creating TF vectors and hashes using – HasingTF and MinHashLSH
model = Pipeline(stages=[
        HashingTF(inputCol="names", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=10)
    ]).fit(df1)
db_hashed = model.transform(df1)
db_matches = model.stages[-1].approxSimilarityJoin(db_hashed, db_hashed, 0.9)
  • save data for later usage
#show duplicate matches and store it as parquet file
db_matches.select(F.col('datasetA.id').alias('id_A'),
                 F.col('datasetB.id').alias('id_B'),
                 F.col('distCol')).filter('distCol == 0 and id_A != id_B').write.format("parquet").mode("overwrite").save("spark-warehouse/matched_pairs")

Example

show all matches (including duplicates)
db_matches.select(F.col('datasetA.id').alias('id_A'),
                  F.col('datasetB.id').alias('id_B'),
                  F.col('distCol')).show()
show non-duplicate matches
db_matches.select(F.col('datasetA.id').alias('id_A'),
                  F.col('datasetB.id').alias('id_B'),
                  F.col('distCol')).filter('id_A < id_B').show()
show duplicate matches
db_matches.select(F.col('datasetA.id').alias('id_A'),
                  F.col('datasetB.id').alias('id_B'),
                  F.col('distCol')).filter('distCol == 0 and id_A != id_B').show()
+--------+--------+------------------+
|    id_A|    id_B|           distCol|
+--------+--------+------------------+
|LP010625|AC001580|0.8888888888888888|
|AC001821|SL000159|             0.875|
|00661188|AC000649|0.8571428571428572|
|NF003737|08711738|0.8333333333333334|
|12821321|LP009978|0.8888888888888888|
|03505733|NF003766|0.8571428571428572|
|06519926|02629724|               0.8|
|LP009130|LP011118|               0.4|
|LP009486|12778855|0.8888888888888888|
|AC001328|AC000649|              0.75|
|LP008435|LP009978|               0.4|
|LP009287|LP008435|               0.4|
|LP008660|LP007108|               0.4|
|SG000575|SG000576|               0.8|
|LP010654|LP010804|               0.4|
|LP010645|LP012737|               0.4|
|SL000467|SL001363|               0.5|
|LP009657|LP009979|               0.4|
|00073396|AC001429|0.6666666666666667|
|FC025026|00015228|              0.75|
+--------+--------+------------------+
only showing top 20 rows

+--------+--------+------------------+
|    id_A|    id_B|           distCol|
+--------+--------+------------------+
|12778855|LP010058|0.8888888888888888|
|00661294|08209948|              0.75|
|03248674|NF003700|               0.8|
|AC001495|FC025026|               0.8|
|LP008435|LP009130|               0.4|
|LP010826|LP013369|               0.4|
|LP009229|LP010053|               0.4|
|LP007177|LP009655|               0.4|
|LP006706|LP010804|               0.4|
|LP007216|LP009133|               0.4|
|LP006484|LP006895|               0.4|
|LP006895|LP008660|               0.4|
|LP007490|LP010629|               0.4|
|LP007490|LP009080|               0.4|
|LP008327|LP009721|               0.4|
|LP008660|LP010627|               0.4|
|04682464|AC001328|0.8333333333333334|
|04682464|LP007599|0.8888888888888888|
|04682464|AC001644|              0.75|
|04682464|AC001821|               0.8|
+--------+--------+------------------+
only showing top 20 rows

+--------+--------+-------+
|    id_A|    id_B|distCol|
+--------+--------+-------+
|LP009723|LP009721|    0.0|
|LP004405|LP004597|    0.0|
|CE020210|07351472|    0.0|
|01178454|AC001346|    0.0|
|FC004002|RC000040|    0.0|
|CE014383|04666284|    0.0|
|NI050682|CE023880|    0.0|
|IP000129|RS00129S|    0.0|
|SL003984|SG000522|    0.0|
|CE015388|CE020264|    0.0|
|AC001204|00973271|    0.0|
|LP010826|LP010966|    0.0|
|NF003737|FC025026|    0.0|
|04666284|CE014383|    0.0|
|AC000068|00021487|    0.0|
|09154454|CE014602|    0.0|
|SL001194|SL000927|    0.0|
|NF004123|02629724|    0.0|
|FC006104|AC001031|    0.0|
|09249197|CE008890|    0.0|
+--------+--------+-------+
only showing top 20 rows