Author: broadoakdata
-
Change data capture ETL pipelines
In this article we look we will look into change data capture ETL pipelines. How do you implement change data capture (CDC)? We will use PySpark to implement CDC using data from Companies House . ETL pipelines have been deployed and tested in Google Cloud Platform. Outline of ETL pipelines for change data capture: Code…
-
Using AWK to process Hive tables metadata and Hadoop file listing output
Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep. Commands SQL – show table extended from {database} like {tablename} HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } |…
-
Working with Companies House snapshot data
Get a snapshot of latest live (excluding dissolved companies) basic company data from – http://download.companieshouse.gov.uk/en_output.html. Latest file (http://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-2022-01-01.zip) contains over 5 million records. It’s a csv file however there are records (22 companies) which contain comma and quotes in the data. You might have to do extra work to parse those records correctly. It is…
-
How to get Companies House data using REST API
Companies House provide REST API, which lets you retrieve information about limited companies in UK. In this article, will use python script to get data from Companies House data using REST API. Prerequisites We need to register with Companies House to access REST API. Follow the steps below: Create a developer account Create an application…
-
permalink – page not found issue
Solving the page not found issue when using permalink This post is about how to solve the page not found problem when using the Permalinks feature of WordPress in Raspberry Pi. About the Issue When using “Post Name” or “Day and name” as the permalink, a not found error occurs when accessing the link in…
-
Update WordPress without FTP
This is a common issue whereby the WordPress system can not write to /wp-content folder directly. Solution To solve this issue you need to define the FTP details in the wp-config.php file so WordPress will remember it.Alternatively, you may also provide WordPress with write access to the /wp-content folder by accessing the FTP root fileand…
-
Raspberry Pi 4 and 400 with portable monitor
Investigating usability of single board personal computer with portable keyboard and monitor. Software down loaded latest Raspberry Pi OS from https://www.raspberrypi.com/software/operating-systems/ installed Apache2, LibreOffice, WordPress and GIMP Example of few hardware configurations
-
Python code for connecting to remote MySQL database using SSH
A working code is given below: Login to remote database and mapping port to host
-
Create CompaniesHouse index in Elasticsearch using PySpark
We are using Spark – 3.1.2 (spark._sc.version). Elasticsearch (7.9.3) running on a docker container with port 9200 is being exposed to host. Perquisites get elasticsearch-spark-30_2.12-7.12.0.jar and add it to spark-jar classpath read companieshouse data into a dataframe write dataframe to elasticsearch Code snippets listed below
-
Deploy WordPress, MySQL, phpMyAdmin, Elasticsearch and Kibana with docker-compose
Assuming you already have Docker installed. Create a docker-compose.yml file with following codes: