Data validation checks

When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion.

Example of data validation checks using Companies House data

We want to run validation checks on key fields of Companies House data before adding them to target database. Any rows that fail checks needs to be identified for further analysis and also if needed remove them from the source data.

Variable:

MortgagesNumMortCharges
- check not null
- check the value is digit of 1 to 3 i.e if it is >= 1000 then validation will fail

  validation_col = "data_validations"
  df_changes = df_changes.withColumn(validation_col,F.lit(""))
  
  cols = [
  'MortgagesNumMortCharges',
  ]   

  for col in cols:
    df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
      df_changes[validation_col], f"check_not_null({col})",
      dc.check_not_null(df_changes[col])))
    df_changes.cache()
  
  val_check_cols = {
    "MortgagesNumMortCharges": ['\d{1,3}']   
  }  
    
  for col in val_check_cols:
    df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
      df_changes[validation_col], f"""check_regex({col} {val_check_cols[col]})""",
      dc.check_regex(df_changes[col], *val_check_cols[col])))
    df_changes.cache()

Check out those failed record

Example of data validation checks using Companies House data

More posts

Using AWS Cloudscape Design System to create a simple cost analysis dashboard

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Customer segmentation analysis using fake data and QuickSight with Amazon Q

Migrating data from SQL Server to PostgreSQL