In this article, we will describe how – handling errors and warnings in PySpark can be handled using dataframe. We will use Companies House data to implement methods for handling errors and warnings. The basic ask is to write out errors and warnings in a Hive table or write into error or warning file.
Requirements
The functionality should include two extra fields in your dataframe. One field for warning and other for capturing error. Each different error/warning should be given a reference code i.e check_not_null or tag number. The log should include a count of how many of each error/warning code raised.
Errors and warnings for Companies House monthly snapshot data processing are given below.
Warnings:
- CompanyNumber
- check not null
- check exactly 8 alphanumeric characters long
- CompanyName
- check not null
- RegAddressAddressLine1
- check not null
- RegAddressPostCode
- check not null
- check postcode is a valid UK format
- check postcode is not in address variables – RegAddressAddressLine1, RegAddressAddressLine2, RegAddressPostTown, RegAddressCounty
Errors:
- CompanyNumber
- check not null
- check exactly 8 alphanumeric characters long
- CompanyName
- check not null
Implementation
For each check, write a separate function and use it to carry out the check. For example:
def check_not_null(col):
assert_column(col)
logging.info(f"checking value in {col} are not null")
return col.isNotNull()
# run warning checks
warn_col = "data_warnings"
df = df.withColumn(warn_col, F.lit(""))
# checking for not null
not_null_cols = [
'CompnyNumber',
'CompanyName'
]
check_columns = column_iterator(not_null_cols, df, warn_col)
df = check_columns(check_not_null)(output_string="check_not_null")
Check if RegAddressPostTown variable contains a valid postcode
Inspecting rows with warning
QA
For example, postcode QA might require the following checks:
- postcode format must be valid
- it is populated – no null value
- postcode not located in any address variables
Leave a Reply