Processing Charity Data using Google Colab

Data related to registered charities in England and Wales can be downloaded from https://register-of-charities.charitycommission.gov.uk/register/full-register-download.

Charity data and running PySpark under Google CoLab

We will use Google Colab to download public available data from Charity Commission website. Transformed or enrich data will saved in Google BigQuery. Please read data definition before ingesting and carrying out exploratory data analysis.

Prerequisites

  • install PySpark
  • get libraries for Google Storage data connector and BigQuery
  • service account key file for Google project
  • have access to BigQuery – projectid, database and tables
  needed for reading data - spark.conf.set("viewsEnabled","true")
  # was unable to get it working
  .config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.1:gcs-connector-hadoop3-latest') 
set ENV variable for GOOGLE_APPLICATION_CREDENTIALS variable
access google drive 
  from google.colab import drive
  drive.mount('/content/drive')

Configure Spark Context

spark = (
    SparkSession.builder.appName('charity_etl')
       .config("spark.executor.memory","8gb") 
       .config("spark.executor.cores",4) 
       .config("spark.dynamicAllocation.enabled","true") 
       .config("viewsEnabled","true") 
       .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
       .config("credentialsFile",GOOGLE_APPLICATION_CREDENTIALS) 
       .config("spark.hadoop.google.cloud.auth.service.account.enable","true") 
       .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile",GOOGLE_APPLICATION_CREDENTIALS) 
       .config("spark.driver.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/gcs-connector-hadoop3-latest.jar")
       .config("spark.executor.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/.jar")
       .enableHiveSupport()
       .getOrCreate()
)

Get charity data and create a dataframe

def get_charity_data(spark):
  !wget https://ccewuksprdoneregsadata1.blob.core.windows.net/data/txt/publicextract.charity.zip\
  !unzip publicextract.charity.zip -d /tmp
  df_charity = spark.read.option("delimiter", "\\t").option("header", "true").csv("file:///tmp/publicextract.charity.txt") 
  # save data to drive
  df_charity.coalesce(1).write.format("parquet").mode("append").save(f"file:///content/drive/MyDrive/data/charity_202301")
  return df_charity

Save data into BigQuery table

# Saving the data to BigQuery
df_charity.write \
  .format("bigquery") \
  .option("temporaryGcsBucket", "gsdata/temp/") \
  .option("parentProject", "ello-project") \
  .option("dataset", "ellodb") \
  .save("ello-project.ellodb.charity")

Reading data from BigQuery

sql_query = """select * from `ello-project.ellodb.charity"""
df = (spark.read.format("bigquery")
      .option("parentProject", "ello-project")
      .option("project", "ellodb")
      .option("materializationDataset", "ellodb")
      .option("query",sql_query).load()
     )
df.persist(StorageLevel.MEMORY_AND_DISK)
BigQuery Google CoLab Charity