When ingesting data into target destination, it is important to make sure data from different sources will conform to business rules and not become corrupted due to inconsistencies in type or context. The goal is to create data that is consistent, accurate and complete so to prevent data loss and errors during ingestion.
Example of data validation checks using Companies House data
We want to run validation checks on key fields of Companies House data before adding them to target database. Any rows that fail checks needs to be identified for further analysis and also if needed remove them from the source data.
Variable:
- MortgagesNumMortCharges
- check not null
- check the value is digit of 1 to 3 i.e if it is >= 1000 then validation will fail
validation_col = "data_validations"
df_changes = df_changes.withColumn(validation_col,F.lit(""))
cols = [
'MortgagesNumMortCharges',
]
for col in cols:
df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
df_changes[validation_col], f"check_not_null({col})",
dc.check_not_null(df_changes[col])))
df_changes.cache()
val_check_cols = {
"MortgagesNumMortCharges": ['\d{1,3}']
}
for col in val_check_cols:
df_changes = df_changes.withColumn(validation_col, dc.add_to_string_col(
df_changes[validation_col], f"""check_regex({col} {val_check_cols[col]})""",
dc.check_regex(df_changes[col], *val_check_cols[col])))
df_changes.cache()
Check out those failed record