Data validation checks

When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion.

Example of data validation checks using Companies House data

We want to run validation checks on key fields of Companies House data before adding them to target database. Any rows that fail checks needs to be identified for further analysis and also if needed remove them from the source data.

Variable:

  • MortgagesNumMortCharges
    • check not null
    • check the value is digit of 1 to 3 i.e if it is >= 1000 then validation will fail
  validation_col = "data_validations"
  df_changes = df_changes.withColumn(validation_col,F.lit(""))
  
  cols = [
  'MortgagesNumMortCharges',
  ]   

  for col in cols:
    df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
      df_changes[validation_col], f"check_not_null({col})",
      dc.check_not_null(df_changes[col])))
    df_changes.cache()
  
  val_check_cols = {
    "MortgagesNumMortCharges": ['\d{1,3}']   
  }  
    
  for col in val_check_cols:
    df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
      df_changes[validation_col], f"""check_regex({col} {val_check_cols[col]})""",
      dc.check_regex(df_changes[col], *val_check_cols[col])))
    df_changes.cache()
15 records failed validation check

Check out those failed record

Details of failed records