Working with Companies House snapshot data

Get a snapshot of latest live (excluding dissolved companies) basic company data from – http://download.companieshouse.gov.uk/en_output.html. Latest file (http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2022-01-01.zip) contains over 5 million records. It’s a csv file however there are records (22 companies) which contain comma and quotes in the data.

You might have to do extra work to parse those records correctly. It is worth trying using options in spark.read method. I.e

spark.read.option(“delimiter”, “,”).csv(file) and option for escape quotes.

Give it it a try with smaller file. Example codes can be found here – https://gitlab.com/akkasali910/companieshouse-data and read related articles at https://broadoakdata.uk/?s=Companies+House