Category: articles

  • Using Apache Flink to process data from Kafka and MySQL database

    I need to use Apache Flink to process data which stored in Kafka and MySql. In my previous article I shared my notes on how to use a free MySQL server (db4free.net) instance for development work. Apache Flink is good processing engine and has nice features for manipulating data using Batch and/or Streaming processing. I…

  • Using Apache Flink to process apache web log files

    In my previous article I shared my approach for processing apache web server log files using PySpark. Here I will try to accomplish the same task using Apache Flink. I am using Apache Flink python (PyFlink) package and Flink SQL. Flink SQL is an ANSI standard compliant SQL engine that can process data both using…

  • Migrate WordPress to Drupal

    As I discovered that my web server has been getting too many xmlrpc.php requests. It is about time that I need to take action to tighten the security of my site. Most of the leading hosting providers recommend to disable call to xmlrpc.php file. I have decided to: In this article, I will share the…

  • Read data from and write data to Kafka using PySpark

    I use Google Colab for my development work. I have setup a Kafka server on my local machine (RapsberryPi). The task in hand to use PySpark to read and write data. You can use Batch or Streaming query. Spark has an extensive documentation at their website under the heading – Structured Streaming + Kafka Integration…

  • Apache Spark UI and key metrics

    “Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.” https://spark.apache.org/ I have been using Apache Spark for developing and testing data pipelines on single-node machine mainly using Google Colaboratory . It is vital to understand how the resources are being used and how can you…

  • Take extra care when processing CSV file using Apache Spark

    I found the following comments very relevant and useful. Please visit the link and read. “I’ve been using Spark for some time now, it has not always been smooth sailing. I can understand a tool’s limitations as long as I’m told so, explicitly. The trouble with Apache Spark has been its insistence on having the…

  • Streaming data processing using Apache Kafka and Flink

    We will use Apache Kafka and Apache Flink to process data from Companies House Stream API . First, we will setup Apache Kafka and Flink in Google Colab platform. You can download the script to install both software from here – https://gitlab.com/akkasali910/companieshouse-data . Or you can use the following bash codes: After running the above…

  • Quick look at Apache Flink

    What is Apache Flink? According to Apache Flink’s website: Stateful Computations over Data Streams Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. https://flink.apache.org We will…

  • Welcome to the world of Generated AI and LLMs

    It seems we are witnessing a digital revolution in the field of Natural Language Processing (NLP). Recent development and use cases of LLMs (Large Language Model) have created opportunities for knowledge discovery and noble usage of data making it easier or asking human like questions and getting back generated answers from LLMs. Thanks to OpenAI…

  • Google Earth Engine and Geospatial analysis

    Just started looking at Google Earth Engine (GEE). We are interested in using satellite images to explore potential use cases: There are total of 1052 satellites in the orbit and have generated exabytes of data. The satellite data is growing fast. Google has built one of the largest and most sophisticated data infrastructures in the…