I found the following comments very relevant and useful. Please visit the link and read.
“I’ve been using Spark for some time now, it has not always been smooth sailing. I can understand a tool’s limitations as long as I’m told so, explicitly. The trouble with Apache Spark has been its insistence on having the wrong defaults. How wrong? So wrong they lose your data in unexpected ways.“
https://kokes.github.io/blog/2019/07/09/losing-data-apache-spark.html
I decided to investigate this on the latest Apache Spark using PySpark. My spark context look like this:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.5.0
Master
local[*]
AppName
SparkConfigExample
Things you need to be aware
- Missing overflows – always check your data fields, types and values
- Dealing with date types – non-existent/wrong dates
- Quotes in your fields
- Quoted fields can contain newlines.
- Quotation marks within a field’s content are escaped with another quotation mark (not a backslash).
- Careful when performing casting
- Untyped partitions are untyped
- Check integrity of your CSV file and data
Concluding remark
It’s easy to use Apache Spark to process large data files. In fact you can read a file and create a dataframe with a single line of command!
df = spark.read.csv(file_path)
You have invested millions of dollars/pounds in your data pipelines are now eagerly waiting to deploy them in your glorious Cloud. Remember garbage in garbage out! Who will deal with those technical debts that have been left behind for you by those smart ass data engineers!
Finally ensure that your data pipelines are able to accomplish two things:
- Atomicity – every ETL or Data piplelines job would either write everything or nothing. There were no partial results.
- Idempotency – this apply to any job, regardless of a scheduler or manual run. It should not matter how many times your run a job, it should always behaves the same way.
QA
Read https://broadoakdata.uk/test-strategy-document/ about QA
Example of missing overflows