Process Companies House data using Google Colab

In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for creating a reference table for storing basic companies details.

Google Colab

Google Cloud Platform provides distributed processing and storage resources at affordable price and most cases free of charge. For Python developer, Google provides Google Colab a platform “to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs“.

Companies House monthly snapshot data

We have developed PySpark ETL scripts to store and process Companies House monthly snapshot data. In this article, we will share our experience of running those scripts on Google Colab.

Companies House offers a number of different data products, including:

Monthly data contains – the company number, address, status, incorporation date, account and annual return filing dates, SIC code, URI and basic information about mortgage charges. A list of the data fields contained in the snapshot can be found here PDF.

Data validation pipeline

With the processing power at our disposal, it is now easier to understand data provided by Companies House and perform following checks on:

  • if the list of fields are as expected in the snapshot file
  • if there are any duplicate values
  • the total count of records including birth and death
  • the number of null values for each column
  • a quick check on incoming data to make sure it does not contain funny data (nothing out of the ordinary happening in the data)

Reference Table

We have populated the reference table with an initial data load and then using SCD Type 2 for managing current and historical data.

Pull ETL script from GitLab
Looking at the values from key variables
It seems there are typos in PostCode
Distribution of Companies by Category
Identify ing changes by comparing incoming data with saved data in the reference table
Change Data Capture implementation
Number of changes