ETL – BroadOakData.UK

In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for creating a reference table for storing basic companies details.

Google Colab

Google Cloud Platform provides distributed processing and storage resources at affordable price and most cases free of charge. For Python developer, Google provides Google Colab a platform “to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs“.

Companies House monthly snapshot data

We have developed PySpark ETL scripts to store and process Companies House monthly snapshot data. In this article, we will share our experience of running those scripts on Google Colab.

Companies House offers a number of different data products, including:

bulk files of company data in CSV format
bulk files of accounts data in iXBRL and HTML formats
people with significant control (PSC) data
uniform resource identifiers (URI)
an XML gateway search service
an application program interface (API) search service

Monthly data contains – the company number, address, status, incorporation date, account and annual return filing dates, SIC code, URI and basic information about mortgage charges. A list of the data fields contained in the snapshot can be found here PDF.

Data validation pipeline

With the processing power at our disposal, it is now easier to understand data provided by Companies House and perform following checks on:

if the list of fields are as expected in the snapshot file
if there are any duplicate values
the total count of records including birth and death
the number of null values for each column
a quick check on incoming data to make sure it does not contain funny data (nothing out of the ordinary happening in the data)