Category: Data Migration
-
Test Plan
An example of a less rigorous test plan. SCOPE The Scope of the Test Plan is to test the User Stories associated with each sprint The QAs will testing the user stories of one sprint behind The User Stories with complete Acceptance Criteria will be tested Developers should provide a complete walk-through of the User…
-
Test Strategy document
A good test strategy document may contains the following. Table of Contents GLOSSARY OF TERMS PURPOSE GUIDING PRINCIPLES CONFORMANCE WITH ORGANISATIONAL TEST STRATEGY PROJECT BACKGROUND PROCESS OVERVIEW DIAGRAM PROJECT SCOPE IN SCOPE (HIGH. LEVEL) OUT OF SCOPE (HIGH LEVEL) TEST APPROACH TEST DELIVERABLES TEST RESOURCES/SUPPORT TEST PHASES AND ENVIRONMENTS TEST ENVIRONMENTS SYSTEM TEST SYSTEM INTEGRATION…
-
Unit Test
Test description – check data changes captured correctly Test Data (SQL or PySpark code) – test_companieshouse_data.py Expected Result – capture the relevant changes in source data by comparing with the previous dataset Actual Result – verify the test result Test Result – output result after running test script – may be screen shot
-
Sprint Test Plan
A sprint test plan may contains the following information: Title i.e. Extracting companies house data and loading them to a master table (MT load) Introduction – Sprint x will consist of backlog items that need to be added as well as any enhancements needed. We will begin to look at the requirements and start test…
-
Testing distributed ETL script
Testing is an important part of software development lifecycle. Identifying defects during development and fixing them before deploying code into production will save time and money. Most importantly it gives assurance to business that the quality of the code is acceptable and it is fit for purpose. A good testing depends on agreed acceptance criteria…
-
Data validation checks
When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion. Example of…
-
Capturing delta data changes between two datasets
Suppose you want to capture delta data changes on key variables as part of the data received. In most cases the dataset will not provide change history. You have to figure it out for yourself which variables were updated by comparing the values from the previous stored data. In this article, we will share an…
-
Extracting Standard Industrial Classification (SIC) codes from ONS web site
In this article, we would extract SIC codes from https://www.ons.gov.uk/file?uri=/methodology/classificationsandstandards/ukstandardindustrialclassificationofeconomicactivities/uksic2007/uksic2007indexeswithaddendumnovember2020.xlsx and save it as parquet file for further processing. We are only interested in data from worksheet ‘Alphabetical Index’. Steps: Download the file from ONS website Extract SIC. codes Save as parquet file import pandas as pd sic_pd = pd.read_excel(‘spark-warehouse/uksic2007indexeswithaddendumnovember2020.xlsx’,\ sheet_name=”Alphabetical Index”, skiprows=1) df_sic =…
-
Data Quality check
We will use basic company data (http://download.companieshouse.gov.uk/en_output.html) for this article. A PySpark script is written for carrying out minimal data quality checks on the dataset from the Companies House. The downloadable data snapshot containing basic company data of live companies on the Companies House register. This snapshot is provided as ZIP files containing data in…
-
Embedded PDI and use of Avro format file
As you may recall from the last article (Embedded PDI and big-data-plugin) that it is difficult to configure big-data-plugin to use from Embedded PDI engine. However you can use the depreciated big-data-plugin steps without any problem. Avro is data format that bundles serialized data with the data’s schema in the same file. Avro is preferred format…