Category: ETL

  • Process Companies House data using Google Colab

    In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for creating a reference table for storing basic companies details.

    Google Colab

    Google Cloud Platform provides distributed processing and storage resources at affordable price and most cases free of charge. For Python developer, Google provides Google Colab a platform “to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs“.

    Companies House monthly snapshot data

    We have developed PySpark ETL scripts to store and process Companies House monthly snapshot data. In this article, we will share our experience of running those scripts on Google Colab.

    Companies House offers a number of different data products, including:

    Monthly data contains – the company number, address, status, incorporation date, account and annual return filing dates, SIC code, URI and basic information about mortgage charges. A list of the data fields contained in the snapshot can be found here PDF.

    Data validation pipeline

    With the processing power at our disposal, it is now easier to understand data provided by Companies House and perform following checks on:

    • if the list of fields are as expected in the snapshot file
    • if there are any duplicate values
    • the total count of records including birth and death
    • the number of null values for each column
    • a quick check on incoming data to make sure it does not contain funny data (nothing out of the ordinary happening in the data)

    Reference Table

    We have populated the reference table with an initial data load and then using SCD Type 2 for managing current and historical data.

    Pull ETL script from GitLab
    Looking at the values from key variables
    It seems there are typos in PostCode
    Distribution of Companies by Category
    Identify ing changes by comparing incoming data with saved data in the reference table
    Change Data Capture implementation
    Number of changes