Data validation check in streaming data

  • check_date_range
  • check_date
  • check_integer
  • check_numeric
  • check_positive_number
  • check_specific_values (missing values)
  • check_regex
  • check_not_null
  • check_not_blank
  • check_postcode
  • check_duplicates

Validation Requirements

The basic ask is to write out errors and warnings in two extra fields in your dataframe. One field for warning and other for capturing error. Each different error/warning should be given a reference code i.e check_not_null or some tag number. The log should include a count of how many of each error/warning code raised.

Use case

  • ingest data from kafka into micro batches – (bounded)
  • transform dataframe – deserialise kafka payload and apply validation check
  • save data (wtiteStrem)

Run python script

Checking number of records in ch-address kafka topic – ch-address