Tag: PySpark

  • Saving data into BigQuery

    In this article, we will show how to export data to BigQuery. You may use Databricks or Google Colab to write PySpark ETL script for saving data into BigQuery. We will use Databricks and Google Cloud Platform for saving data into BigQuery. Prerequisites Setup an account to use Databricks Community Edition Google Cloud Platform –…

  • Handling errors and warnings in PySpark

    In this article, we will describe how – handling errors and warnings in PySpark can be handled using dataframe. We will use Companies House data to implement methods for handling errors and warnings. The basic ask is to write out errors and warnings in a Hive table or write into error or warning file. Requirements…

  • Generating synthetic data using HIVE SQL

    UPDATE – 07/10/2022 using sparkContext.range() You can use sparkContext.range() function to generate rows and then use withColumn for adding variables to a dataframe. It generate ‘a column with name ‘id’. Drop it you do not need it using df.drop(‘id’). In this article, we will show way of generating synthetic data using HIVE SQL. You can…

  • Process Companies House data using Google Colab

    In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for…

  • Merge multiple rows sharing id into one row

    UPDATE – 07/07/2022 It can be achieved using few lines of PySpark codes. See below: In this article, we will show how to merge multiple rows sharing id into one row using PySpark. We will use Companies House dataset for this article. You may way find previous articles about how to get companies house data…

  • Locality Sensitive Hashing for finding similar Company names

    In this article we will use Locality Sensitive Hashing for finding similar Company names and will use data from Companies House as mentioned in the previous article. We will use PySpark pipeline to streamline the process of finding similar company names. Background Fuzzy/Approximate  matching two strings means calculating how similar two strings are and one…

  • Change data capture ETL pipelines

    In this article we look we will look into change data capture ETL pipelines. How do you implement change data capture (CDC)? We will use PySpark to implement CDC using data from Companies House . ETL pipelines have been deployed and tested in Google Cloud Platform. Outline of ETL pipelines for change data capture: Code…

  • Using AWK to process Hive tables metadata and Hadoop file listing output

    Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep. Commands SQL – show table extended from {database} like {tablename} HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } |…

  • Working with Companies House snapshot data

    Get a snapshot of latest live (excluding dissolved companies) basic company data from – http://download.companieshouse.gov.uk/en_output.html. Latest file (http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2022-01-01.zip) contains over 5 million records. It’s a csv file however there are records (22 companies) which contain comma and quotes in the data. You might have to do extra work to parse those records correctly. It is…

  • Near duplicate detection using Locality Sensitivity Hashing (LSH)

    Locality sensitive hashing (LSH) is a method for finding similar pairs in a large dataset. For a dataset of size N, the brute force method of comparing every possible pair would take N!/(2!(N-2)!) ~ N²/2 = O(N²) time. The LSH method aims to cut this down to O(N) time. In this article we will show…