Read data from Google Sheets and save a worksheet into BigQuery and MySQL.
Steps:
- Upload service key file into Google Colab
- Use gspread to extract data from worksheet
- Save result as pandas dataframe
- create pyspark dataframe
- write dataframe into MySQL and BigQuery
Writing data into MySQL
See the previous article – Google CoLab and relational database Spark Context details and how to get data from MySQL database.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('BigQuery')\
.config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", ",".join(sparkjars))\
.enableHiveSupport()\
.getOrCreate()
Getting data from Google Sheets
import pandas as pd
import gspread
import io
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from oauth2client.service_account import ServiceAccountCredentials
from google.colab import files
# upload service key and rename/copy it as "service.key" in Google CoLab
# Init Google (with Auth)
file_path = r"service.json"
scope = ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name(file_path, scope)
# Used to auth Google Sheets
gc = gspread.authorize(credentials)
# Used to get all files in Google Drive folder
gauth = GoogleAuth()
gauth.credentials = credentials
drive = GoogleDrive(gauth)
# open spreadsheet and extract data
id = '12LXnqs5vwoVj7rDjmlntO_V4LAeEGG5wM-whjEnzgo4'
wb = gc.open_by_key(id)
sheet1 = wb.worksheet('A4. Base Data') # Enter your sheet name.
original_df = sheet1.get('A2:J2803335')
df = pd.DataFrame(original_df,columns=['DriverNo','Make','Model','Description','RegisteredDate','EngineCC','Fuel','Co2','FinanceCost','FinanceHouseName'])
Save dataframe to BigQuery
#Create PySpark DataFrame from Pandas
invoice=spark.createDataFrame(df)
invoice.printSchema()
invoice.show()
# Saving data into BigQuery
invoice.write \
.format("bigquery") \
.option("temporaryGcsBucket", "buckeet_name/temp/") \
.option("parentProject", "quickstart-25853678031891") \
.option("dataset", "ellodb") \
.save("quickstart-25853678031891.ellodb.invoice")
Read data from BigQuery
# read data from BigQuery
stmt = """select * from `quickstart-25853678031891.ellodb.invoice`"""
df = (spark.read.format("bigquery")
.option("parentProject", "quickstart-25853678031891")
.option("project", "ellodb")
.option("materializationDataset", "ellodb_tmp")
.option("query",stmt).load()
)
Ref: Google CoLab