Data related to registered charities in England and Wales can be downloaded from https://register-of-charities.charitycommission.gov.uk/register/full-register-download.
Charity data and running PySpark under Google CoLab
We will use Google Colab to download public available data from Charity Commission website. Transformed or enrich data will saved in Google BigQuery. Please read data definition before ingesting and carrying out exploratory data analysis.
Prerequisites
- install PySpark
- get libraries for Google Storage data connector and BigQuery
- service account key file for Google project
- have access to BigQuery – projectid, database and tables
needed for reading data - spark.conf.set("viewsEnabled","true")
# was unable to get it working
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.1:gcs-connector-hadoop3-latest')
set ENV variable for GOOGLE_APPLICATION_CREDENTIALS variable
access google drive
from google.colab import drive
drive.mount('/content/drive')
Configure Spark Context
spark = (
SparkSession.builder.appName('charity_etl')
.config("spark.executor.memory","8gb")
.config("spark.executor.cores",4)
.config("spark.dynamicAllocation.enabled","true")
.config("viewsEnabled","true")
.config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.config("credentialsFile",GOOGLE_APPLICATION_CREDENTIALS)
.config("spark.hadoop.google.cloud.auth.service.account.enable","true")
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile",GOOGLE_APPLICATION_CREDENTIALS)
.config("spark.driver.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/gcs-connector-hadoop3-latest.jar")
.config("spark.executor.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/.jar")
.enableHiveSupport()
.getOrCreate()
)
Get charity data and create a dataframe
def get_charity_data(spark):
!wget https://ccewuksprdoneregsadata1.blob.core.windows.net/data/txt/publicextract.charity.zip\
!unzip publicextract.charity.zip -d /tmp
df_charity = spark.read.option("delimiter", "\\t").option("header", "true").csv("file:///tmp/publicextract.charity.txt")
# save data to drive
df_charity.coalesce(1).write.format("parquet").mode("append").save(f"file:///content/drive/MyDrive/data/charity_202301")
return df_charity
Save data into BigQuery table
# Saving the data to BigQuery
df_charity.write \
.format("bigquery") \
.option("temporaryGcsBucket", "gsdata/temp/") \
.option("parentProject", "ello-project") \
.option("dataset", "ellodb") \
.save("ello-project.ellodb.charity")
Reading data from BigQuery
sql_query = """select * from `ello-project.ellodb.charity"""
df = (spark.read.format("bigquery")
.option("parentProject", "ello-project")
.option("project", "ellodb")
.option("materializationDataset", "ellodb")
.option("query",sql_query).load()
)
df.persist(StorageLevel.MEMORY_AND_DISK)