Writing AVRO file using PDI

Pentaho provides number of Hadoop distributions (shims) which are included in the distribution. You need to enable shims in order to write and read data in avro or parquet format:

  1. locate pentaho-big-data-plugin and shims directory ${PENTAHO}/data-integration/plugins/pentaho-big-data-plugin and edit plugin.properties and set active.hadoop.configuration=hdp30
  2. cross check shims name under pentaho-big-data-plugin/hadoop-configurations directory
  3. You need google-bigquery plugin and give access to PDI the location of service account JSON file for access to GCP. This can be done by setting a new system environmental variable on your operating system named ‘GOOGLE_APPLICATION_CREDENTIALS‘ . Following minimum roles required on Google Storage for this to work:
    1. Storage Object Creator
    2. Storage Object Viewer

For example you can load the avro file into BigQuery using bq command line tool. eg: bq load –autodetect –source_format=AVRO project-name.dataset.table_name gs://bucketname/test.avro