Tag: PySpark
-
Impute Standard Industrial Classification (SIC) 2003 from SIC 2007
The United Kingdom Standard Industrial Classification of Economic Activities (SIC) is used to classify business establishments and other standard units by the type of economic activity in which they are engaged. The new version of these codes (SIC 2007) was adopted by the UK as from 1st January 2008. In this article, we will use…
-
Unit Test
Test description – check data changes captured correctly Test Data (SQL or PySpark code) – test_companieshouse_data.py Expected Result – capture the relevant changes in source data by comparing with the previous dataset Actual Result – verify the test result Test Result – output result after running test script – may be screen shot
-
Sprint Test Plan
A sprint test plan may contains the following information: Title i.e. Extracting companies house data and loading them to a master table (MT load) Introduction – Sprint x will consist of backlog items that need to be added as well as any enhancements needed. We will begin to look at the requirements and start test…
-
Testing distributed ETL script
Testing is an important part of software development lifecycle. Identifying defects during development and fixing them before deploying code into production will save time and money. Most importantly it gives assurance to business that the quality of the code is acceptable and it is fit for purpose. A good testing depends on agreed acceptance criteria…
-
Data validation checks
When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion. Example of…
-
Capturing delta data changes between two datasets
Suppose you want to capture delta data changes on key variables as part of the data received. In most cases the dataset will not provide change history. You have to figure it out for yourself which variables were updated by comparing the values from the previous stored data. In this article, we will share an…
-
Extracting Standard Industrial Classification (SIC) codes from ONS web site
In this article, we would extract SIC codes from https://www.ons.gov.uk/file?uri=/methodology/classificationsandstandards/ukstandardindustrialclassificationofeconomicactivities/uksic2007/uksic2007indexeswithaddendumnovember2020.xlsx and save it as parquet file for further processing. We are only interested in data from worksheet ‘Alphabetical Index’. Steps: Download the file from ONS website Extract SIC. codes Save as parquet file import pandas as pd sic_pd = pd.read_excel(‘spark-warehouse/uksic2007indexeswithaddendumnovember2020.xlsx’,\ sheet_name=”Alphabetical Index”, skiprows=1) df_sic =…
-
Data Quality check
We will use basic company data (http://download.companieshouse.gov.uk/en_output.html) for this article. A PySpark script is written for carrying out minimal data quality checks on the dataset from the Companies House. The downloadable data snapshot containing basic company data of live companies on the Companies House register. This snapshot is provided as ZIP files containing data in…