Category: Data Migration

  • Processing GLIEF data in JSON format

    UPDATE – 13/04/2024 – There is a work around which involved updating JSON file using linux tools such as – sed and awk. The file contains array of JSON objects. The issue is that Apache Spark read the whole contents instead of processing each JSON object. Problem Get dataset and unzip the dataset i.e. after…

  • Process Companies House data using Google Colab

    In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for…

  • Simple ETL pipeline using Apache Nifi

    In this article, we will compose a simple ETL pipeline using Apache Nifi. We want to read Excel File and convert it to a csv file using Apache NiFi ConvertExcelToCSV Processor. Prerequisites install Apache Nifi (we used $ brew install nifi) basic knowledge of ETL concept and data integration tool Data Pipeline Apache Nifi is…

  • Change data capture ETL pipelines

    In this article we look we will look into change data capture ETL pipelines. How do you implement change data capture (CDC)? We will use PySpark to implement CDC using data from Companies House . ETL pipelines have been deployed and tested in Google Cloud Platform. Outline of ETL pipelines for change data capture: Code…

  • Using AWK to process Hive tables metadata and Hadoop file listing output

    Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep. Commands SQL – show table extended from {database} like {tablename} HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } |…

  • Working with Companies House snapshot data

    Get a snapshot of latest live (excluding dissolved companies) basic company data from – http://download.companieshouse.gov.uk/en_output.html. Latest file (http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2022-01-01.zip) contains over 5 million records. It’s a csv file however there are records (22 companies) which contain comma and quotes in the data. You might have to do extra work to parse those records correctly. It is…

  • Create CompaniesHouse index in Elasticsearch using PySpark

    We are using Spark – 3.1.2 (spark._sc.version). Elasticsearch (7.9.3) running on a docker container with port 9200 is being exposed to host. Perquisites get elasticsearch-spark-30_2.12-7.12.0.jar and add it to spark-jar classpath read companieshouse data into a dataframe write dataframe to elasticsearch Code snippets listed below

  • Near duplicate detection using Locality Sensitivity Hashing (LSH)

    Locality sensitive hashing (LSH) is a method for finding similar pairs in a large dataset. For a dataset of size N, the brute force method of comparing every possible pair would take N!/(2!(N-2)!) ~ N²/2 = O(N²) time. The LSH method aims to cut this down to O(N) time. In this article we will show…

  • Impute Standard Industrial Classification (SIC) 2003 from SIC 2007

    The United Kingdom Standard Industrial Classification of Economic Activities (SIC) is used to classify business establishments and other standard units by the type of economic activity in which they are engaged. The new version of these codes (SIC 2007) was adopted by the UK as from 1st January 2008. In this article, we will use…

  • Test Case

    A test case might be created either manually or automated to verify the functionality per the original acceptance criteria. High Level Steps Pre-Requisites Check connection details of source and target systems User access to various tools and access rights etc. Checks Check the ETL script name(s) are aligning with naming conventions Tables/Files are present in landing zone…