Tag: BigQuery

  • Google CoLab and relational database

    We will show how to access relational database using Google CoLab. We will use both python module and PySpark for accessing MySQL database and writing data to it. See other articles related to BigQuery and Google Colab Prerequisites For example, you will only need mysql-connector-python for accessing MySQL database. Connect to MySQL database Assuming you…

  • Load data into BigQuery using Terraform

    We will load data into BigQuery using Terraform. Our task: We will use Google Cloud Shell, a tool for managing resources hosted on Google Cloud Platform. It provides a VM with 5GB home directory will persist across sessions, but the VM is ephemeral and will be reset approximately 20 minutes after your session ends. Terraform…

  • Processing UK Companies House PSC Data

    We will look into the People with significant control (PSC) snapshot data which can be downloaded from Companies House website. Snapshot data file is provided in JSON format and can be downloaded as a single file or as multiple files for ease of downloading. We will use single file in Google CoLab to carry out…

  • Processing Charity Data using Google Colab

    Data related to registered charities in England and Wales can be downloaded from https://register-of-charities.charitycommission.gov.uk/register/full-register-download. Charity data and running PySpark under Google CoLab We will use Google Colab to download public available data from Charity Commission website. Transformed or enrich data will saved in Google BigQuery. Please read data definition before ingesting and carrying out exploratory…

  • Reading GLIEF data in XML format and storing to BigQuery

    In the previous article – Processing GLIEF data in JSON format, described how to ingest data into Databricks (community edition) data lake using PySpark. However we were unable to process GLEIF Golden Copy JSON format file due memory issue and complexity of nested JSON objects. The input file size (after unzipping the file size was…

  • Embedded PDI and use of Avro format file

    As you may recall from the last article (Embedded PDI and big-data-plugin) that it is difficult to configure big-data-plugin to use from Embedded PDI engine. However you can use the depreciated big-data-plugin steps without any problem. Avro is data format that bundles serialized data with the data’s schema in the same file. Avro is preferred format…

  • How to copy BigQuery tables between locations with PDI

    Suppose you have accidentally created a dataset in US region instead of EU. Your dataset has few tables with large amount of data. How do you copy or move data from one region to another. You could use command line tool bq cp: Unfortunately the copy command does not support cross region copies. You can…