In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for creating a reference table for storing basic companies details.
Google Colab
Google Cloud Platform provides distributed processing and storage resources at affordable price and most cases free of charge. For Python developer, Google provides Google Colab a platform “to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs“.
Companies House monthly snapshot data
We have developed PySpark ETL scripts to store and process Companies House monthly snapshot data. In this article, we will share our experience of running those scripts on Google Colab.
Companies House offers a number of different data products, including:
- bulk files of company data in CSV format
- bulk files of accounts data in iXBRL and HTML formats
- people with significant control (PSC) data
- uniform resource identifiers (URI)
- an XML gateway search service
- an application program interface (API) search service
Monthly data contains – the company number, address, status, incorporation date, account and annual return filing dates, SIC code, URI and basic information about mortgage charges. A list of the data fields contained in the snapshot can be found here PDF.
Data validation pipeline
With the processing power at our disposal, it is now easier to understand data provided by Companies House and perform following checks on:
- if the list of fields are as expected in the snapshot file
- if there are any duplicate values
- the total count of records including birth and death
- the number of null values for each column
- a quick check on incoming data to make sure it does not contain funny data (nothing out of the ordinary happening in the data)
Reference Table
We have populated the reference table with an initial data load and then using SCD Type 2 for managing current and historical data.