Saving data into BigQuery

In this article, we will show how to export data to BigQuery. You may use Databricks or Google Colab to write PySpark ETL script for saving data into BigQuery. We will use Databricks and Google Cloud Platform for saving data into BigQuery.

Prerequisites

  • Setup an account to use Databricks Community Edition
  • Google Cloud Platform – get a service account key file, create buckets in Google Cloud Storage and dataset in BigQuery
  • Configure Databricks cluster to use BigQuery hadoop connector

Adding data to BigQuery via Google Cloud Storage

Spark -> GCS -> BigQuery

Add the following Spark configuration in your cluster during creation of a cluster:

  • spark.hadoop.google.cloud.auth.service.account.enable true
  • spark.hadoop.google.cloud.auth.service.account.json.keyfile /tmp/service.json

Make sure you upload your service credentials json file to Databricks DBFS and then copy it to cluster filestore.

  • dbutils.fs.cp(“/first_skein_329508_2055ecccecec.json”,”file:///tmp/service.json”)

Create a dataframe

Our PySpark script will use local copy of Companies House data and following steps to move data from local storage to BigQuery via intermediary (temporary Cloud Storage):

  • create dataframe
  • save dataframe to BigQuery
df = spark.read.format("parquet").load("dbfs:/user/hive/warehouse/companies_house_master")

# Saving the data to BigQuery
df.write \
  .format("bigquery") \
  .option("temporaryGcsBucket", "temporary_bucket") \
  .option("parentProject", "project_name") \
  .option("dataset", "dataset_name") \
  .mode("overwrite")
  .save("project_name.dataset_name.utcdb.companies_house_master")