Tag: PySpark
-
Hive array operations
In this article we describe how to add and remove items from array in Hive using PySpark. We are going to use two array functions to accomplish the given task: UPDATE – output might contains duplicate rows. Use groupBy with agg to deals with multiple columns. You may need to use the following function: Ref:…
-
Add or remove items from array using PySpark
In this article, we will use HIVE and PySpark to manipulate complex datatype i.e. array<string>. We show how to add or remove items from array using PySpark We will use datasets consist of three units, representing paye, crn and vat units. For sample data see – https://broadoakdata.uk/synthetic-data-creation-linking-records/ We need to link and unlink few units…
-
Search for a dissolved company
We will show how to search for a dissolved company using Companies House Public API. Using Databricks to call the api and then creating dataframe using PySpark. Steps See How to get Companies House data using REST API Code snippets
-
Synthetic data creation and linking records
In this article we will how to generate synthetic data and linking records on particular field. Approach We will use three dataframes and will link records using value my matching a variable. See previous article https://broadoakdata.uk/generating-synthetic-data-using-hive-sql/ for reference. Code snippets Generating synthetic data using PostgreSQL
-
Processing UK Companies House PSC Data
We will look into the People with significant control (PSC) snapshot data which can be downloaded from Companies House website. Snapshot data file is provided in JSON format and can be downloaded as a single file or as multiple files for ease of downloading. We will use single file in Google CoLab to carry out…
-
Synthetic data creation using mimesys and pyspark
In the previous article we described how to generate synthetic data using HIVE SQL. However in this we will use python package mimesys together with pyspark for synthetic data. Prerequisite for generating synthetic data Create dataframe with 10 rows For example, we will create a dataframe with 10 rows. See code snippets below: Synthetic data…
-
Processing Charity Data using Google Colab
Data related to registered charities in England and Wales can be downloaded from https://register-of-charities.charitycommission.gov.uk/register/full-register-download. Charity data and running PySpark under Google CoLab We will use Google Colab to download public available data from Charity Commission website. Transformed or enrich data will saved in Google BigQuery. Please read data definition before ingesting and carrying out exploratory…
-
Reading GLIEF data in XML format and storing to BigQuery
In the previous article – Processing GLIEF data in JSON format, described how to ingest data into Databricks (community edition) data lake using PySpark. However we were unable to process GLEIF Golden Copy JSON format file due memory issue and complexity of nested JSON objects. The input file size (after unzipping the file size was…
-
Processing GLIEF data in JSON format
UPDATE – 13/04/2024 – There is a work around which involved updating JSON file using linux tools such as – sed and awk. The file contains array of JSON objects. The issue is that Apache Spark read the whole contents instead of processing each JSON object. Problem Get dataset and unzip the dataset i.e. after…
-
Change Data Capture Implementation using PySpark
In this article, we will describe an approach for Change Data Capture Implementation using PySpark. All the functions are included in the example together with test data. The is an updated version Change data capture ETL pipelines. UPDATE – 10/06/2023 using HIVE SQL to implement find_changes will take less time than processing dataframe using PySpark.…