Reading large XML file using Pentaho and Apache Hop

Refer to my pervious article – Processing GLIEF data in JSON format . I wanted refresh my knowledge of Pentaho Data Integration tool and see if I could process huge XML file without running into Java OOM (Java Out of Memory) problem.

Pentaho offers Input step called – XML Input Stream (StAX – Streaming API for XML), which can be used to parse large XML file iteratively. I used this example at – Parsing Huge XML Files in Pentaho Kettle to read and process Gleif XML datasets.

Tasks

Read XML file
Filter not null values i.e. xml_data_value IS NOT NULL
Transform using JavaScript – increment group variable – using first attribute in a record (make sure it is unique) – in our case it is LEI
Use group variable in Row denormaliser step to combines data from several rows into one row by creating new Target columns/fields
Save data as parquet file

pentaho row denormaliser - processing large XML file

Validating Parquet output file

I used DuckDB in Google CoLab to verify the result. DuckDB provides support for both reading and writing Parquet (a columnar based file format that is efficient for storage and querying data).

import duckdb
conn = duckdb.connect()

results = conn.execute('''
SELECT *
FROM read_parquet('/content/drive/MyDrive/data/glief.parquet20240319.parquet') 
''').df()

DuckDB query result save as pandas dataframe

Using Apache Hop

“The Hop Orchestration Platform, or Apache Hop, aims to facilitate all aspects of data and metadata orchestration.”
https://hop.apache.org/

It was a surprisingly easy task to import Pentaho transformation into Apache Hop using importer tool which comes with it.

Hop could not recognise the Parquet Output step and needed to use Hop’s Parquet Output File step.

Running Hop in Google Colab

It is possible to run Hop on Google CoLab. You just need to download Hop runtime and run hop-run.sh.

Hop processing large XML file - took 2 hours 28 minutes!

%%bash
#I copied Hop into Gdrive 
cd /content/drive/MyDrive/hop/
bash ./hop-run.sh -f /content/drive/MyDrive/data/xml-test.hpl

Last thoughts

Apache Hop (created from forked Kettle source codes) or Pentaho is a great tool for working with small datasets (<= 2 Gb for better performance) and easy to create ETL pipelines using drag and drop features of the tool (hop-gui.sh). Apache Hop offers more and also can run ETL pipelines on different engines – Local and Spark, Flink & Dataproc using Apache Beam.

Pentaho/Kettle has been around for a long time and many small, medium and large corporations are using it for their data integration and ETL capabilities.

I would recommend Apache Hop instead of Pentaho/Kettle. However for processing large data engineering tasks use serverless ETL tools such AWS Glue, Google Dataproc. For processing large datasets, one should consider using engine like Apache Spark, Apache Flink, Trino.

Reading large XML file using Pentaho and Apache Hop

Tasks

Validating Parquet output file

Using Apache Hop

Running Hop in Google Colab

Last thoughts

Comments

Leave a Reply Cancel reply

More posts

Using AWS Cloudscape Design System to create a simple cost analysis dashboard

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Customer segmentation analysis using fake data and QuickSight with Amazon Q

Migrating data from SQL Server to PostgreSQL