Handling errors and warnings in PySpark

In this article, we will describe how – handling errors and warnings in PySpark can be handled using dataframe. We will use Companies House data to implement methods for handling errors and warnings. The basic ask is to write out errors and warnings in a Hive table or write into error or warning file.

Requirements

The functionality should include two extra fields in your dataframe. One field for warning and other for capturing error. Each different error/warning should be given a reference code i.e check_not_null or tag number. The log should include a count of how many of each error/warning code raised.

Errors and warnings for Companies House monthly snapshot data processing are given below.

Warnings:

  • CompanyNumber
    • check not null
    • check exactly 8 alphanumeric characters long
  • CompanyName
    • check not null
  • RegAddressAddressLine1
    • check not null
  • RegAddressPostCode
    • check not null
    • check postcode is a valid UK format
    • check postcode is not in address variables – RegAddressAddressLine1, RegAddressAddressLine2, RegAddressPostTown, RegAddressCounty

Errors:

  • CompanyNumber
    • check not null
    • check exactly 8 alphanumeric characters long
  • CompanyName
    • check not null

Implementation

For each check, write a separate function and use it to carry out the check. For example:

def check_not_null(col):
   assert_column(col)
   logging.info(f"checking value in {col} are not null")
   return col.isNotNull()

# run warning checks
warn_col = "data_warnings"
df = df.withColumn(warn_col, F.lit(""))
# checking for not null
not_null_cols = [
   'CompnyNumber',
   'CompanyName'
]
check_columns = column_iterator(not_null_cols, df, warn_col)
df = check_columns(check_not_null)(output_string="check_not_null")

Check if RegAddressPostTown variable contains a valid postcode

code snippet
output

Inspecting rows with warning

error rows

QA

For example, postcode QA might require the following checks:

  • postcode format must be valid
  • it is populated – no null value
  • postcode not located in any address variables

You can find codes at Gitlab.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *