Embedded PDI and use of Avro format file

As you may recall from the last article (Embedded PDI and big-data-plugin) that it is difficult to configure big-data-plugin to use from Embedded PDI engine. However you can use the depreciated big-data-plugin steps without any problem.

Avro is data format that bundles serialized data with the data’s schema in the same file. Avro is preferred format for loading data into BigQuery. It is faster to load data and the data can be read in parallel, even if the data blocks are compressed. BigQuery supports SNAPPY and DEFLATE compression codecs for data blocks.

A simple data flow to load data into BigQuery:

  1. Extract data from source and save it as Avro format with compression
  2. use Google BigQuery REST API or bq command to load the data into BigQuery

An example implementation of the above data pipeline using PDI:

  • PDI transformation which extract data from Rightmove site for given estate agent(s) and save the result in Avro format file in local file storage.
  • PDI job which will invoke the above transformation and run bq command from Shell step.
bq load  --use_avro_logical_types --autodetect  --source_format=AVRO ${P_DATASET}.${P_TABLENAME} ${P_DATADIR}/rightmove.avro

Result of calling PDI job using KaratePDI: