Pentaho provides number of Hadoop distributions (shims) which are included in the distribution. You need to enable shims in order to write and read data in avro or parquet format:
- locate pentaho-big-data-plugin and shims directory ${PENTAHO}/data-integration/plugins/pentaho-big-data-plugin and edit plugin.properties and set active.hadoop.configuration=hdp30
- cross check shims name under pentaho-big-data-plugin/hadoop-configurations directory
- You need google-bigquery plugin and give access to PDI the location of service account JSON file for access to GCP. This can be done by setting a new system environmental variable on your operating system named ‘GOOGLE_APPLICATION_CREDENTIALS‘ . Following minimum roles required on Google Storage for this to work:
- Storage Object Creator
- Storage Object Viewer
For example you can load the avro file into BigQuery using bq command line tool. eg: bq load –autodetect –source_format=AVRO project-name.dataset.table_name gs://bucketname/test.avro