Tag: Java
-
Simple ETL pipeline using Apache Nifi
In this article, we will compose a simple ETL pipeline using Apache Nifi. We want to read Excel File and convert it to a csv file using Apache NiFi ConvertExcelToCSV Processor. Prerequisites install Apache Nifi (we used $ brew install nifi) basic knowledge of ETL concept and data integration tool Data Pipeline Apache Nifi is…
-
Installing jdk 1.8 on Raspberry Pi 2
Embbeded Pentaho engine requires Java 8. Follow the steps below to install jdk 1.8 on Raspberry Pi: Get JDK1.8 for for Linux ARM v6/v7 Hard Float ABI from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html sudo tar zxvf jdk-8u271-linux-arm32-vfp-hflt.tar.gz -C /opt Set default java and javac to the new installed jdk8. sudo update-alternatives –install /usr/bin/javac javac /opt/jdk1.8.0_271/bin/javac 1 sudo update-alternatives –install /usr/bin/java…
-
Embedded PDI and use of Avro format file
As you may recall from the last article (Embedded PDI and big-data-plugin) that it is difficult to configure big-data-plugin to use from Embedded PDI engine. However you can use the depreciated big-data-plugin steps without any problem. Avro is data format that bundles serialized data with the data’s schema in the same file. Avro is preferred format…
-
Embedded PDI and big-data-plugin
How do you make pentaho-big-data-plugin available to your application? From version 6, the PDI has added support for Karaf, it is an OSGI platform that allows developers to create dynamically binding modules which can be added and removed from a running platform without the need for a restart. This feature makes it difficult to use…
-
Deploy Embedded PDI in a Kubernetes cluster
Suppose you have a Kubernetes cluster up and running. You want to deploy a docker container (https://hub.docker.com/repository/docker/aliakkas/karatepdi). Karatepadi is a Spring Boot Rest application for automating ETL testing using Behaviour Driven Development framework Karate and embedded Pentaho Data Integration engine. You can use it to execute PDI job/transformation or running Karate feature file to produce…
-
How to copy BigQuery tables between locations with PDI
Suppose you have accidentally created a dataset in US region instead of EU. Your dataset has few tables with large amount of data. How do you copy or move data from one region to another. You could use command line tool bq cp: Unfortunately the copy command does not support cross region copies. You can…
-
Writing AVRO file using PDI
Pentaho provides number of Hadoop distributions (shims) which are included in the distribution. You need to enable shims in order to write and read data in avro or parquet format: locate pentaho-big-data-plugin and shims directory ${PENTAHO}/data-integration/plugins/pentaho-big-data-plugin and edit plugin.properties and set active.hadoop.configuration=hdp30 cross check shims name under pentaho-big-data-plugin/hadoop-configurations directory You need google-bigquery plugin and give…
-
No joy with Google App Engine
Deploying KaratePDI (a Spring Boot application) to Google App Engine. It was an expensive mistake for number of reasons: unable to integrate Google storage with embedded Pentaho Data Integration engine require refactoring of the code Google App Engine charges by hour per instances – will get charges for not using the app! what a scum!!!!…
-
Generate a Google access token from a JSON service account file in PDI transformation step
Overview This article explains how to generate an access token from a service account JSON file using User Defined Java Class (UDJC) step. Once a token is generated then you can use it to access Google APIs using Pentaho Rest Client or other steps. Prerequisites Create a project in Google Cloud Platform. Activate the API…
-
Automating Pentaho Data Integration testing using Behavioural Driven Development Framework
Pentaho Data Integration capabilities facilitate the process of capturing, cleansing, and storing data using a uniform and consistent format that is accessible and relevant to solve many of the data integration challenges. Testing ETL process can be accomplished using PDI embedded engine and Cucumber. Developed a java based tool which provide set of HTTP endpoints…