Category: articles

  • Running PostgreSQL database in Termux

    UPDATE – 06/01/2024 added few useful commands UPDATE – 10/03/2023 add bytes json data using – dataset and psycopg2-binary python packages UPDATE – 25/02/2023 After database update and postgresql may not start. It might throw error messages: In this article, we will show how to run PostgreSQL database in Termux on Android device. Create skeleton…

  • Change Data Capture Implementation using PySpark

    In this article, we will describe an approach for Change Data Capture Implementation using PySpark. All the functions are included in the example together with test data. The is an updated version Change data capture ETL pipelines. UPDATE – 10/06/2023 using HIVE SQL to implement find_changes will take less time than processing dataframe using PySpark.…

  • Saving data into BigQuery

    In this article, we will show how to export data to BigQuery. You may use Databricks or Google Colab to write PySpark ETL script for saving data into BigQuery. We will use Databricks and Google Cloud Platform for saving data into BigQuery. Prerequisites Setup an account to use Databricks Community Edition Google Cloud Platform –…

  • Handling errors and warnings in PySpark

    In this article, we will describe how – handling errors and warnings in PySpark can be handled using dataframe. We will use Companies House data to implement methods for handling errors and warnings. The basic ask is to write out errors and warnings in a Hive table or write into error or warning file. Requirements…

  • Generating synthetic data using HIVE SQL

    UPDATE – 07/10/2022 using sparkContext.range() You can use sparkContext.range() function to generate rows and then use withColumn for adding variables to a dataframe. It generate ‘a column with name ‘id’. Drop it you do not need it using df.drop(‘id’). In this article, we will show way of generating synthetic data using HIVE SQL. You can…

  • Companies House Data – Free and Paid (Part 3 of 3)

    What it does not include (only available in XML / Application Protocol Interface – API): Registered office address, Company profile, Search, Officers, Registers, Charges, Filing history, Insolvency, Exemptions, Officer disqualifications, Officer appointments, UK Establishments, Persons with significant control (PSC). For programming technical officers we provide a link to the how. https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference Web address, Directors Details,…

  • Running Jupyter Notebook on Android phone

    Running Jupyter Notebook on Android phone is doable. We will show you the steps to run Jupyter Notebook on your anroid phone. Steps to follow You have Termux instslled Open Termux, then run the following commands in the prompt: $ apt install clang python fftw libzmq freetype libpng pkg-config libcrypt $ LDFLAGS=”-lm -lcompiler_rt” pip install…

  • Merge multiple rows sharing id into one row

    UPDATE – 07/07/2022 It can be achieved using few lines of PySpark codes. See below: In this article, we will show how to merge multiple rows sharing id into one row using PySpark. We will use Companies House dataset for this article. You may way find previous articles about how to get companies house data…

  • Locality Sensitive Hashing for finding similar Company names

    In this article we will use Locality Sensitive Hashing for finding similar Company names and will use data from Companies House as mentioned in the previous article. We will use PySpark pipeline to streamline the process of finding similar company names. Background Fuzzy/Approximate  matching two strings means calculating how similar two strings are and one…

  • Install Google Cloud CLI in Termux

    Install gcloud CLI to access Google Cloud Shell via SSH on Android using Termux. First: Run curl https://sdk.cloud.google.com | bash Note: This will fail when trying to install components. Ignore this. Then $PREFIX/google-cloud-sdk/install.sh –override-components (without specifying components) Will add gcloud to $PATH And then gcloud components install gsutil Finally gcloud –console-only Using the –console-only flag is useful if you’re running…