Category: ETL

  • Process Companies House data using Google Colab

    In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for…

  • Simple ETL pipeline using Apache Nifi

    In this article, we will compose a simple ETL pipeline using Apache Nifi. We want to read Excel File and convert it to a csv file using Apache NiFi ConvertExcelToCSV Processor. Prerequisites install Apache Nifi (we used $ brew install nifi) basic knowledge of ETL concept and data integration tool Data Pipeline Apache Nifi is…

  • Change data capture ETL pipelines

    In this article we look we will look into change data capture ETL pipelines. How do you implement change data capture (CDC)? We will use PySpark to implement CDC using data from Companies House . ETL pipelines have been deployed and tested in Google Cloud Platform. Outline of ETL pipelines for change data capture: Code…

  • Create CompaniesHouse index in Elasticsearch using PySpark

    We are using Spark – 3.1.2 (spark._sc.version). Elasticsearch (7.9.3) running on a docker container with port 9200 is being exposed to host. Perquisites get elasticsearch-spark-30_2.12-7.12.0.jar and add it to spark-jar classpath read companieshouse data into a dataframe write dataframe to elasticsearch Code snippets listed below

  • Near duplicate detection using Locality Sensitivity Hashing (LSH)

    Locality sensitive hashing (LSH) is a method for finding similar pairs in a large dataset. For a dataset of size N, the brute force method of comparing every possible pair would take N!/(2!(N-2)!) ~ N²/2 = O(N²) time. The LSH method aims to cut this down to O(N) time. In this article we will show…

  • Impute Standard Industrial Classification (SIC) 2003 from SIC 2007

    The United Kingdom Standard Industrial Classification of Economic Activities (SIC) is used to classify business establishments and other standard units by the type of economic activity in which they are engaged. The new version of these codes (SIC 2007) was adopted by the UK as from 1st January 2008. In this article, we will use…

  • Testing distributed ETL script

    Testing is an important part of software development lifecycle. Identifying defects during development and fixing them before deploying code into production will save time and money. Most importantly it gives assurance to business that the quality of the code is acceptable and it is fit for purpose. A good testing depends on agreed acceptance criteria…

  • Data validation checks

    When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion. Example of…

  • Capturing delta data changes between two datasets

    Suppose you want to capture delta data changes on key variables as part of the data received. In most cases the dataset will not provide change history. You have to figure it out for yourself which variables were updated by comparing the values from the previous stored data. In this article, we will share an…