Tag: MySQL

  • Google CoLab and relational database

    We will show how to access relational database using Google CoLab. We will use both python module and PySpark for accessing MySQL database and writing data to it.

    See other articles related to BigQuery and Google Colab

    Prerequisites

    • Install python relevant packages
      • !pip3 install pyspark mysql-connector-python
    • upload jar files for PySpark
      • gcs-connector-hadoop3-latest.jar
      • spark-bigquery-latest_2.12.jar
      • mariadb-java-client-2.4.4.jar

    For example, you will only need mysql-connector-python for accessing MySQL database.

    Connect to MySQL database

    Assuming you have access to MySQL server. You may get free MySQL instance from: (https://www.db4free.net/ or https://pantheon.io/register)

    # importing required libraries
    import mysql.connector
      
    dataBase = mysql.connector.connect(
      host ="ipaddress or hostname",
      port = 11111,
      user ="pantheon",
      passwd ="password",
      database = "pantheon"
    ) 
    # preparing a cursor object
    cursorObject = dataBase.cursor()
      
    query = "SELECT * FROM wp_users"
    cursorObject.execute(query)
       
    myresult = cursorObject.fetchall()
       
    for x in myresult:
      print(x)
     
    # disconnecting from server
    dataBase.close()

    Accessing MySQL using PySpark

    We need to setup a SparkContext with option – spark.jars. I am also using other options to use BigQuery and Cloud Storage.

    Upload your jar files in a folder. Provide jar file location when setting spark.jars option. For more than one jar files, please separate each jar file name with comma.

    • mysql-connector-java-8.0.19.jar or mariadb-java-client-2.4.4.jar (for MariaDB)
    • gcs-connector-hadoop3-latest.jar
    • spark-bigquery-latest_2.12.jar
    import os
    sparkjarsdir = "/content/companieshouse-data/jars"
    sparkjars = [os.path.join(sparkjarsdir, x) for x in os.listdir(sparkjarsdir)]
    
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
      .appName('BigQuery')\
      .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
      .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
      .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
      .config("spark.jars", ",".join(sparkjars))\
      .enableHiveSupport()\
      .getOrCreate()

    Read data from a table

    # jdbc:mariadb will return column name for field value
    # query or table - using query
    query = f"""select display_name, user_email from {table}
    """
    driver = "org.mariadb.jdbc.Driver"
    url = f"jdbc:mysql://{database_host}:{database_port}/{database_name}"
    # create dataframe
    df = (spark.read
      .format("jdbc")
      .option("driver", driver)
      .option("url", url)
      .option("query", query)
      .option("user", user)
      .option("password", password)
      .load()
      )
    mirpurutc.com

    Save dataframe to database

    # write to MySQL database - for adding new rows use - mode("append")
    (df.write
      .format("jdbc")
      .option("driver", driver)
      .option("url", url)
      .option("dbtable", "aa_user")
      .option("user", user)
      .option("password", password)
      .mode("overwrite")
      .save()
    )