We will load data into BigQuery using Terraform. Our task:
- create a Google Storage bucket with two folders for holding source and target data
- create a dataset and two tables in BigQuery
- Move data from local datasets to Google Storage and load data into respective BigQuery tables
We will use Google Cloud Shell, a tool for managing resources hosted on Google Cloud Platform. It provides a VM with 5GB home directory will persist across sessions, but the VM is ephemeral and will be reset approximately 20 minutes after your session ends.
Terraform commands
- terraform init
- terraform plan
- terraform apply
Steps
- define required resources in a file using json format – I will name it bigquery.tf
- create subfolder – data/source and data/target
- place files containing data and metadata into respective folder
- run terraform commands – init -> plan -> apply
- use terraform destroy to destroy created resources
Terraform for creating resources
# need provide project_id
provider "google" {
project = "your_project_id"
}
resource "random_id" "storage" {
byte_length = 4
prefix = "bi_"
}
resource "google_storage_bucket" "bi_bucket" {
name = random_id.storage.hex
location = "europe-west2"
force_destroy = true
lifecycle_rule {
condition {
age = 2
}
action {
type = "Delete"
}
}
}
resource "google_storage_bucket_object" "companieshouse_source_csv" {
name = "source/companies_house.csv"
source = "./data/source/companies_house.csv"
bucket = random_id.storage.hex
depends_on = [
google_storage_bucket.bi_bucket,
]
}
resource "google_storage_bucket_object" "companieshouse_target_csv" {
name = "target/companies_house.csv"
source = "./data/target/companies_house.csv"
bucket = random_id.storage.hex
depends_on = [
google_storage_bucket.bi_bucket,
]
}
resource "google_bigquery_dataset" "default" {
dataset_id = "bidb"
friendly_name = "bi dataset"
description = "Ingest daily companies house data"
location = "EU"
default_table_expiration_ms = 10800000 // 3 hr
depends_on = [
google_storage_bucket_object.companieshouse_target_csv,
google_storage_bucket_object.companieshouse_source_csv,
]
}
resource "google_bigquery_table" "companies_house" {
dataset_id = google_bigquery_dataset.default.dataset_id
table_id = "companies_house"
deletion_protection = false
schema = file("./data/target/companies_house.json")
external_data_configuration {
autodetect = true
source_format = "CSV"
csv_options {
quote = "\""
skip_leading_rows = 1
}
source_uris = [
"${google_storage_bucket.bi_bucket.url}/target/companies_house.csv",
]
}
depends_on = [
google_bigquery_dataset.default,
]
}
resource "google_bigquery_table" "companies_house_stg" {
dataset_id = google_bigquery_dataset.default.dataset_id
table_id = "companies_house_stg"
deletion_protection = false
schema = file("./data/source/companies_house.json")
external_data_configuration {
autodetect = true
source_format = "CSV"
csv_options {
quote = "\""
skip_leading_rows = 1
}
source_uris = [
"${google_storage_bucket.bi_bucket.url}/source/companies_house.csv",
]
}
depends_on = [
google_bigquery_dataset.default,
]
}
Metadata
[
{
"name": "companynumber",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "companyname",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "legalstatus",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "incorporationdate",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "dissolutiondate",
"type": "STRING",
"mode": "NULLABLE"
}
]
Data file
Source data
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
c1,"BigData Company",1,"2019-01-23",,
c2,"Data Ltd",1,"2018-01-30",,
c3,"Hello Data",1,"2017-02-21","2023-05-19",
Target data
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
c1,"BigData Company",1,"2019-01-23",,
c3,"Hello Data",1,"2017-02-21",,
SQL for comparing datasets
SELECT
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
FROM `bidb.companies_house`
EXCEPT DISTINCT
SELECT
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
FROM `bidb.companies_house_stg`;
Screenshots
Destroy resources – terraform destroy