Tag: PySpark

  • AWS Data Wrangler (AWS SDK for Pandas)

    AWS Data Wrangler (awswrangler) is now called AWS SDK for Pandas), is a python library which make it easier to integrate AWS services for ETL (Extract Transform Load) works. I will use Google Colab and AWS SDK for Pandas for creating: Prerequisites Steps Let’s get started. Verify your table Go to AWS Console and check…

  • Data validation check in streaming data

    In this article, I will share my approach (see my previous article for further information on data validation) for carrying out data validation check using streaming data from a Kafka topic. Checking data is very important. Many often this task is get overlooked. One cannot trust source data, especially if you are ingesting data for…

  • Read data from and write data to Kafka using PySpark

    I use Google Colab for my development work. I have setup a Kafka server on my local machine (RapsberryPi). The task in hand to use PySpark to read and write data. You can use Batch or Streaming query. Spark has an extensive documentation at their website under the heading – Structured Streaming + Kafka Integration…

  • Apache Spark UI and key metrics

    “Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.” https://spark.apache.org/ I have been using Apache Spark for developing and testing data pipelines on single-node machine mainly using Google Colaboratory . It is vital to understand how the resources are being used and how can you…

  • Take extra care when processing CSV file using Apache Spark

    I found the following comments very relevant and useful. Please visit the link and read. “I’ve been using Spark for some time now, it has not always been smooth sailing. I can understand a tool’s limitations as long as I’m told so, explicitly. The trouble with Apache Spark has been its insistence on having the…

  • Saving data as parquet table with partition key

    We will show how to add latest changes into parquet table with partition key. Approach Code snippets CDC implementation Synthetic data Example code for using above functions

  • Generating delta for Synthetic data

    How do you include delta records in new synthetic dataset? In previous article we describe how to generate synthetic data and will use the same method to solve this question. Approach Code showing generating delta See Google CoLab notebook Implementation – codes Screen shots

  • Save Google Sheets data into BigQuery and MySQL

    Read data from Google Sheets and save a worksheet into BigQuery and MySQL. Steps: Writing data into MySQL See the previous article – Google CoLab and relational database Spark Context details and how to get data from MySQL database. Getting data from Google Sheets Save dataframe to BigQuery Read data from BigQuery Ref: Google CoLab

  • Google CoLab and relational database

    We will show how to access relational database using Google CoLab. We will use both python module and PySpark for accessing MySQL database and writing data to it. See other articles related to BigQuery and Google Colab Prerequisites For example, you will only need mysql-connector-python for accessing MySQL database. Connect to MySQL database Assuming you…

  • Load data into BigQuery using Terraform

    We will load data into BigQuery using Terraform. Our task: We will use Google Cloud Shell, a tool for managing resources hosted on Google Cloud Platform. It provides a VM with 5GB home directory will persist across sessions, but the VM is ephemeral and will be reset approximately 20 minutes after your session ends. Terraform…