Reading large XML file using Pentaho and Apache Hop

Pentaho offers Input step called – XML Input Stream (StAX – Streaming API for XML), which can be used to parse large XML file iteratively. I used this example at – Parsing Huge XML Files in Pentaho Kettle to read and process Gleif XML datasets.

  • Read XML file
  • Filter not null values i.e. xml_data_value IS NOT NULL
  • Transform using JavaScript – increment group variable – using first attribute in a record (make sure it is unique) – in our case it is LEI
  • Use group variable in Row denormaliser step to combines data from several rows into one row by creating new Target columns/fields
  • Save data as parquet file
pentaho row denormaliser - processing large XML file
Preview of data - Pentaho
Row denormaliser - pentaho

I used DuckDB in Google CoLab to verify the result. DuckDB provides support for both reading and writing Parquet (a columnar based file format that is efficient for storage and querying data).

DuckDB query result save as pandas dataframe

“The Hop Orchestration Platform, or Apache Hop, aims to facilitate all aspects of data and metadata orchestration.”

https://hop.apache.org/

It was a surprisingly easy task to import Pentaho transformation into Apache Hop using importer tool which comes with it.

Hop could not recognise the Parquet Output step and needed to use Hop’s Parquet Output File step.

It is possible to run Hop on Google CoLab. You just need to download Hop runtime and run hop-run.sh.

Hop processing large XML file - took 2 hours 28 minutes!


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *