Load data into BigQuery using Terraform

We will load data into BigQuery using Terraform. Our task:

create a Google Storage bucket with two folders for holding source and target data
create a dataset and two tables in BigQuery
Move data from local datasets to Google Storage and load data into respective BigQuery tables

We will use Google Cloud Shell, a tool for managing resources hosted on Google Cloud Platform. It provides a VM with 5GB home directory will persist across sessions, but the VM is ephemeral and will be reset approximately 20 minutes after your session ends.

Terraform commands

terraform init
terraform plan
terraform apply

Steps

define required resources in a file using json format – I will name it bigquery.tf
create subfolder – data/source and data/target
place files containing data and metadata into respective folder
run terraform commands – init -> plan -> apply
use terraform destroy to destroy created resources

Terraform for creating resources

# need provide project_id
provider "google" {
  project = "your_project_id"
}

resource "random_id" "storage" {
  byte_length = 4
  prefix      = "bi_"
}

resource "google_storage_bucket" "bi_bucket" {
    name          = random_id.storage.hex
    location      = "europe-west2"
    force_destroy = true

    lifecycle_rule {
        condition {
            age = 2
        }
        action {
            type = "Delete"
        }
    }
}

resource "google_storage_bucket_object" "companieshouse_source_csv" {
  name   = "source/companies_house.csv"
  source = "./data/source/companies_house.csv"
  bucket = random_id.storage.hex
  depends_on = [
    google_storage_bucket.bi_bucket,
  ]
}

resource "google_storage_bucket_object" "companieshouse_target_csv" {
  name   = "target/companies_house.csv"
  source = "./data/target/companies_house.csv"
  bucket = random_id.storage.hex
  depends_on = [
    google_storage_bucket.bi_bucket,
  ]
}

resource "google_bigquery_dataset" "default" {
    dataset_id                  = "bidb"
    friendly_name               = "bi dataset"
    description                 = "Ingest daily companies house data"
    location                    = "EU"
    default_table_expiration_ms = 10800000 // 3 hr

    depends_on = [
    google_storage_bucket_object.companieshouse_target_csv,
    google_storage_bucket_object.companieshouse_source_csv,
  ]
}

resource "google_bigquery_table" "companies_house" {
    dataset_id = google_bigquery_dataset.default.dataset_id
    table_id   = "companies_house"
    deletion_protection = false

    schema = file("./data/target/companies_house.json")

    external_data_configuration {
        autodetect    = true
        source_format = "CSV"

        csv_options {
            quote = "\""
            skip_leading_rows = 1
        }

        source_uris = [
            "${google_storage_bucket.bi_bucket.url}/target/companies_house.csv",
        ]
    }

    depends_on = [
        google_bigquery_dataset.default,
    ]
}

resource "google_bigquery_table" "companies_house_stg" {
    dataset_id = google_bigquery_dataset.default.dataset_id
    table_id   = "companies_house_stg"
    deletion_protection = false

    schema = file("./data/source/companies_house.json")

    external_data_configuration {
        autodetect    = true
        source_format = "CSV"

        csv_options {
            quote = "\""
            skip_leading_rows = 1
        }

        source_uris = [
            "${google_storage_bucket.bi_bucket.url}/source/companies_house.csv",
        ]
    }

    depends_on = [
        google_bigquery_dataset.default,
    ]
}

Metadata

[
    {
        "name": "companynumber",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "companyname",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "legalstatus",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "incorporationdate",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dissolutiondate",
        "type": "STRING",
        "mode": "NULLABLE"
    }
]

Data file

Source data
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
c1,"BigData Company",1,"2019-01-23",,
c2,"Data Ltd",1,"2018-01-30",,
c3,"Hello Data",1,"2017-02-21","2023-05-19",

Target data
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
c1,"BigData Company",1,"2019-01-23",,
c3,"Hello Data",1,"2017-02-21",,

SQL for comparing datasets

SELECT 
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
FROM `bidb.companies_house`
EXCEPT DISTINCT
SELECT 
companynumber,companyname,legalstatus,incorporationdate,dissolutiondate
FROM `bidb.companies_house_stg`;

Screenshots

Destroy resources – terraform destroy