Take extra care when processing CSV file using Apache Spark

I found the following comments very relevant and useful. Please visit the link and read.

“I’ve been using Spark for some time now, it has not always been smooth sailing. I can understand a tool’s limitations as long as I’m told so, explicitly. The trouble with Apache Spark has been its insistence on having the wrong defaults. How wrong? So wrong they lose your data in unexpected ways.“
https://kokes.github.io/blog/2019/07/09/losing-data-apache-spark.html

I decided to investigate this on the latest Apache Spark using PySpark. My spark context look like this:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.5.0
Master
local[*]
AppName
SparkConfigExample

Things you need to be aware

Missing overflows – always check your data fields, types and values
Dealing with date types – non-existent/wrong dates
Quotes in your fields
- Quoted fields can contain newlines.
- Quotation marks within a field’s content are escaped with another quotation mark (not a backslash).
Careful when performing casting
Untyped partitions are untyped
Check integrity of your CSV file and data

Concluding remark

It’s easy to use Apache Spark to process large data files. In fact you can read a file and create a dataframe with a single line of command!

df = spark.read.csv(file_path)

You have invested millions of dollars/pounds in your data pipelines are now eagerly waiting to deploy them in your glorious Cloud. Remember garbage in garbage out! Who will deal with those technical debts that have been left behind for you by those smart ass data engineers!

Finally ensure that your data pipelines are able to accomplish two things:

Atomicity – every ETL or Data piplelines job would either write everything or nothing. There were no partial results.
Idempotency – this apply to any job, regardless of a scheduler or manual run. It should not matter how many times your run a job, it should always behaves the same way.

QA

Read https://broadoakdata.uk/test-strategy-document/ about QA

Example of missing overflows

Take extra care when processing CSV file using Apache Spark

Things you need to be aware

Concluding remark

QA

More posts

Using AWS Cloudscape Design System to create a simple cost analysis dashboard

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Customer segmentation analysis using fake data and QuickSight with Amazon Q

Migrating data from SQL Server to PostgreSQL