Tag: PySpark

  • Processing Charity Data using Google Colab

    Data related to registered charities in England and Wales can be downloaded from https://register-of-charities.charitycommission.gov.uk/register/full-register-download.

    Charity data and running PySpark under Google CoLab

    We will use Google Colab to download public available data from Charity Commission website. Transformed or enrich data will saved in Google BigQuery. Please read data definition before ingesting and carrying out exploratory data analysis.

    Prerequisites

    • install PySpark
    • get libraries for Google Storage data connector and BigQuery
    • service account key file for Google project
    • have access to BigQuery – projectid, database and tables
      needed for reading data - spark.conf.set("viewsEnabled","true")
      # was unable to get it working
      .config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.1:gcs-connector-hadoop3-latest') 
    set ENV variable for GOOGLE_APPLICATION_CREDENTIALS variable
    access google drive 
      from google.colab import drive
      drive.mount('/content/drive')

    Configure Spark Context

    spark = (
        SparkSession.builder.appName('charity_etl')
           .config("spark.executor.memory","8gb") 
           .config("spark.executor.cores",4) 
           .config("spark.dynamicAllocation.enabled","true") 
           .config("viewsEnabled","true") 
           .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
           .config("credentialsFile",GOOGLE_APPLICATION_CREDENTIALS) 
           .config("spark.hadoop.google.cloud.auth.service.account.enable","true") 
           .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile",GOOGLE_APPLICATION_CREDENTIALS) 
           .config("spark.driver.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/gcs-connector-hadoop3-latest.jar")
           .config("spark.executor.extraClassPath", "/content/drive/MyDrive/api/spark-bigquery-latest_2.12.jar:/content/drive/MyDrive/api/.jar")
           .enableHiveSupport()
           .getOrCreate()
    )

    Get charity data and create a dataframe

    def get_charity_data(spark):
      !wget https://ccewuksprdoneregsadata1.blob.core.windows.net/data/txt/publicextract.charity.zip\
      !unzip publicextract.charity.zip -d /tmp
      df_charity = spark.read.option("delimiter", "\\t").option("header", "true").csv("file:///tmp/publicextract.charity.txt") 
      # save data to drive
      df_charity.coalesce(1).write.format("parquet").mode("append").save(f"file:///content/drive/MyDrive/data/charity_202301")
      return df_charity

    Save data into BigQuery table

    # Saving the data to BigQuery
    df_charity.write \
      .format("bigquery") \
      .option("temporaryGcsBucket", "gsdata/temp/") \
      .option("parentProject", "ello-project") \
      .option("dataset", "ellodb") \
      .save("ello-project.ellodb.charity")

    Reading data from BigQuery

    sql_query = """select * from `ello-project.ellodb.charity"""
    df = (spark.read.format("bigquery")
          .option("parentProject", "ello-project")
          .option("project", "ellodb")
          .option("materializationDataset", "ellodb")
          .option("query",sql_query).load()
         )
    df.persist(StorageLevel.MEMORY_AND_DISK)
    BigQuery Google CoLab Charity