Tag: GCP

  • Google Earth Engine and Geospatial analysis

    Just started looking at Google Earth Engine (GEE). We are interested in using satellite images to explore potential use cases: There are total of 1052 satellites in the orbit and have generated exabytes of data. The satellite data is growing fast. Google has built one of the largest and most sophisticated data infrastructures in the…

  • Loading JSON data into Neo4j database

    Following on from recent article which describes method for using Neo4j database on Google CoLab. We will show how to load JSON data into a Neo4j database. Prerequisites Implementation We will use Pandas and py2neo packages for the task in hand. Output

  • Google Translation API

    We will show how to use Google Translation API using Google Colab. Prerequisites Steps Code to translate text Translating a HTML file

  • Scheduling Python Script to run on GCP

    I came across this article Setting up a Recurring Google Cloud Function With Terraform and used serverless approach for scheduling python script which will run periodically to invoke Companies House PSC stream to get data and save them into free PostgreSQL database. I am using Google Cloud Function and Cloud Scheduler  to accomplish this task without need for…

  • Processing Companies House PSC stream data

    Need to access PSC stream API and extract information about persons with significant control. However the result has personal information such name, part of date of birth, nationality and address with postcode. Please read following articles – Processing UK Companies House PSC Data and Companies House Stream API more detailed information on Companies House PSC…

  • Reading GLIEF data in XML format and storing to BigQuery

    In the previous article – Processing GLIEF data in JSON format, described how to ingest data into Databricks (community edition) data lake using PySpark. However we were unable to process GLEIF Golden Copy JSON format file due memory issue and complexity of nested JSON objects. The input file size (after unzipping the file size was…

  • Process Companies House data using Google Colab

    In this article we will discuss how to process Companies House data using Google Colab. In previous articles we have demonstrated ETL processes using Companies House monthly snapshot data. Most of the processing were carried out local machine. Recent monthly snapshot contains just over 5 million records. We are interested to build ETL pipeline for…

  • Install Google Cloud CLI in Termux

    Install gcloud CLI to access Google Cloud Shell via SSH on Android using Termux. First: Run curl https://sdk.cloud.google.com | bash Note: This will fail when trying to install components. Ignore this. Then $PREFIX/google-cloud-sdk/install.sh –override-components (without specifying components) Will add gcloud to $PATH And then gcloud components install gsutil Finally gcloud –console-only Using the –console-only flag is useful if you’re running…

  • Creating Cloud MySQL instance with Terraform

    “Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.” In this post, will show how to use Terraform to mange Google Cloud SQL resources. Prerequisites: Google Cloud Platform account Terraform installed on your machine (you can find…

  • Writing AVRO file using PDI

    Pentaho provides number of Hadoop distributions (shims) which are included in the distribution. You need to enable shims in order to write and read data in avro or parquet format: locate pentaho-big-data-plugin and shims directory ${PENTAHO}/data-integration/plugins/pentaho-big-data-plugin and edit plugin.properties and set active.hadoop.configuration=hdp30 cross check shims name under pentaho-big-data-plugin/hadoop-configurations directory You need google-bigquery plugin and give…