Writing AVRO file using PDI

Written by

Pentaho provides number of Hadoop distributions (shims) which are included in the distribution. You need to enable shims in order to write and read data in avro or parquet format:

locate pentaho-big-data-plugin and shims directory ${PENTAHO}/data-integration/plugins/pentaho-big-data-plugin and edit plugin.properties and set active.hadoop.configuration=hdp30
cross check shims name under pentaho-big-data-plugin/hadoop-configurations directory
You need google-bigquery plugin and give access to PDI the location of service account JSON file for access to GCP. This can be done by setting a new system environmental variable on your operating system named ‘GOOGLE_APPLICATION_CREDENTIALS‘ . Following minimum roles required on Google Storage for this to work:
1. Storage Object Creator
2. Storage Object Viewer

For example you can load the avro file into BigQuery using bq command line tool. eg: bq load –autodetect –source_format=AVRO project-name.dataset.table_name gs://bucketname/test.avro

GCP Java

Writing AVRO file using PDI

More posts

Agentic AI is Just Microservices 2.0: Let’s Get Real

Building a Hybrid Agentic Dashboard: Local RAG + AWS Bedrock on Serverless Lambda

The Digital Boardroom: How “Agentic AI” is Changing Decision Making

Building Enterprise Multi-Agent Systems with AWS Strands SDK

From Chatbot to Workforce: Building End-to-End Multi-Agent Apps with AWS Bedrock & Streamlit