Companies House provides various stream end points to obtain data related to companies. This https://stream.companieshouse.gov.uk/companies provides latest changes recorded about a company. Not all data points will be present for a company.
Data capture and storage options
- Streaming process to extract data from the end point
- HTTP library for sending GET request and handling response from the server. i.e. response = requests(url, auth=(api_key, ”), stream=True)
- Process response data and save it to a persistent storage. There are many options:
- kafka
- relational database – PostgreSQL
- cloud storage like Google GS, BigQuery or AWS DynamoDB, S3
- Deploy streaming ETL script in your local server or your preferred cloud provider using serverless functions such as Google Function or AWS Lamda.
Processing data using AWS DynamoDB Local
DynamoDB local stores its data in SQLite database file and therefor It is possible to pull data from the database directly. However the data is saved as AWS DynamoDB specific JSON format and need to be converted to normal JSON format for compatibility reason or greater ease.
SQLite
sqlite3 -header -csv ./data/shared-local-instance.db "select * from company_profile;" > company_profile.csv
or using AWS command line tool
aws dynamodb scan --table-name company_profile --endpoint-url http://localhost:8000 --max-items 2 --output json > ./export.json
Processing of DynamoDB json file
- Extract data from DynamoDB LOcal using aws command line tool – or SQLite database. eg. aws dynamodb scan –table-name company_profile –endpoint-url http://localhost:8000 –max-items 2 –output json > ./export.json
- Read exported file and convert it to normal JSON format
- Create a dataframe for further use or data analysis – Pandas or Spark (provided you are running Spark session)
DATA_TYPES = {
"S": lambda x: str(x),
"B": lambda x: bool(x),
"N": lambda x: str(x),
"L": lambda x: cast_value_by_type(x),
"M": lambda x: cast_value_by_type(x),
}
# convert DynamoDB json to normal JSON
def cast_value_by_type(raw):
raw_type = type(raw)
if raw_type is list:
raw_list = []
for index in raw:
raw_list.append(cast_value_by_type(index))
raw = raw_list
return raw
elif raw_type is dict:
for key in raw.keys():
try:
raw = DATA_TYPES[key](raw[key])
break
except:
raw[key] = cast_value_by_type(raw[key])
return raw
# read DynamoDB Json file
file_path = "/content/export.json"
dynamo_data = json.loads(open(file_path, 'r').read())
# DynamoDB json to json
dynamo_json_data = cast_value_by_type(dynamo_data)
# create Pandas dataframe
import pandas as pd
df = pd.read_json("/content/dump-export.json")
# create another dataframe for nested json field
df1 = pd.json_normalize(df['Items'])
Company Profile data items description
The company Profile resource data. |
Company accounts information. |
The Accounting Reference Date (ARD) of the company. |
The Accounting Reference Date (ARD) day. |
The Accounting Reference Date (ARD) month. |
The last company accounts filed. |
The date the last company accounts were made up to. |
The type of the last company accounts filed. For enumeration descriptions see account_type section in the enumeration mappings. Possible values are: null full small medium group dormant interim initial total-exemption-full total-exemption-small partial-exemption audit-exemption-subsidiary filing-exemption-subsidiary micro-entity no-accounts-type-available audited-abridged unaudited-abridged |
The date the next company accounts are due. |
The date the next company accounts should be made up to. |
Flag indicating if the company accounts are overdue. |
Annual return information. This member is only returned if a confirmation statement has not be filed. |
The date the last annual return was made up to. |
The date the next annual return is due. This member will only be returned if a confirmation statement has not been filed and the date is before 28th July 2016, otherwise refer to confirmation_statement.next_due |
The date the next annual return should be made up to. This member will only be returned if a confirmation statement has not been filed and the date is before 30th July 2016, otherwise refer to confirmation_statement.next_made_up_to |
Flag indicating if the annual return is overdue. |
UK branch of a foreign company. |
Type of business undertaken by the UK establishment. |
Parent company name. |
Parent company number. |
Flag indicating whether this company can file. |
The name of the company. |
The number of the company. |
The status of the company. For enumeration descriptions see company_status section in the enumeration mappings Possible values are: active dissolved liquidation receivership administration voluntary-arrangement converted-closed insolvency-proceedings registered removed closed open |
Extra details about the status of the company. For enumeration descriptions see company_status_detail section in the enumeration mappings. Possible values are: transferred-from-uk active-proposal-to-strike-off petition-to-restore-dissolved transformed-to-se converted-to-plc |
Confirmation statement information (N.B. refers to the Annual Statement where type is registered-overseas-entity) |
The date to which the company last made a confirmation statement. |
The date by which the next confimation statement must be received. |
The date to which the company must next make a confirmation statement. |
Flag indicating if the confirmation statement is overdue |
The date which the company was converted/closed, dissolved or removed. Please refer to company status to determine which. |
The date when the company was created. |
The ETag of the resource. |
Foreign company details. |
Accounts requirement. |
Type of accounting requirement that applies. For enumeration descriptions see foreign_account_type section in the enumeration mappings. Possible values are: accounting-requirements-of-originating-country-apply accounting-requirements-of-originating-country-do-not-apply |
Describes how the publication date is derived. For enumeration descriptions see terms_of_account_publication section in the enumeration mappings. Possible values are: accounts-publication-date-supplied-by-company accounting-publication-date-does-not-need-to-be-supplied-by-company accounting-reference-date-allocated-by-companies-house |
Foreign company account information. |
Date account period starts under parent law. |
Day on which accounting period starts under parent law. |
Month in which accounting period starts under parent law. |
Date account period ends under parent law. |
Day on which accounting period ends under parent law. |
Month in which accounting period ends under parent law. |
Time allowed from period end for disclosure of accounts under parent law. |
Number of months within which to file. |
Type of business undertaken by the company. |
Legal form of the company in the country of incorporation. |
Law governing the company in country of incorporation. |
Is it a financial or credit institution. |
Company origin informations |
Country in which company was incorporated. |
Identity of register in country of incorporation. |
Registration number in company of incorporation. |
The flag indicating if the company has been liquidated in the past. |
The flag indicating if the company has any charges. |
The flag indicating if the company has insolvency history. |
The flag indicating if the company is a Community Interest Company. |
The jurisdiction specifies the political body responsible for the company. Possible values are: england-wales wales scotland northern-ireland european-union united-kingdom england noneu |
The date of last full members list update. |
A set of URLs related to the resource, including self. |
The URL of the persons with significant control list resource. |
The URL of the persons with significant control statements list resource. |
The URL of the registers resource for this company |
The URL of the resource. |
The previous names of this company. |
The date on which the company name ceased. |
The date from which the company name was effective. |
The previous company name |
The address of the company’s registered office. |
The first line of the address. |
The second line of the address. |
The care of name. |
The country. Possible values are: Wales England Scotland Great Britain Not specified United Kingdom Northern Ireland |
The locality e.g London. |
The post-office box number. |
The postal code e.g CF14 3UZ. |
The property name or number. |
The region e.g Surrey. |
Flag indicating registered office address as been replaced. |
The correspondence address of a Registered overseas entity |
The first line of the address. |
The second line of the address. |
The care of name. |
The country e.g. United Kingdom. |
The locality e.g London. |
The post-office box number. |
The postal code e.g CF14 3UZ. |
The region e.g Surrey. |
SIC codes for this company. |
The total count of super secure managing officers for a registered-overseas-entity. |
The type of the company. For enumeration descriptions see company_type section in the enumeration mappings Possible values are: private-unlimited ltd plc old-public-company private-limited-guarant-nsc-limited-exemption limited-partnership private-limited-guarant-nsc converted-or-closed private-unlimited-nsc private-limited-shares-section-30-exemption protected-cell-company assurance-company oversea-company eeig icvc-securities icvc-warrant icvc-umbrella registered-society-non-jurisdictional industrial-and-provident-society northern-ireland northern-ireland-other royal-charter investment-company-with-variable-capital unregistered-company llp other european-public-limited-liability-company-se uk-establishment scottish-partnership charitable-incorporated-organisation scottish-charitable-incorporated-organisation further-education-or-sixth-form-college-corporation registered-overseas-entity |
Flag indicating whether post can be delivered to the registered office. |
Link to the related resource |
Array of fields that have been changed by this event. Nested fields are referenced by dot notation e.g. links.document_metadata |
The date and time the data notification was raised |
The point-in-time identifier for this stream document. Use to re-establish a connection to the stream at this point. |
The type of event denoted by this stream document. Possible values are: changed deleted |
The ID of the resource. |
The type of resource contained within the stream document. Possible values are: company-profile#company-profile filing-history#filing-history |
The URI of the resource. |
Having an understanding of data, one should be able to get better business insight from the streaming data.
{
"Item": [
{
"resource_id": "08632930",
"data.company_status": "active",
"item": {
"resource_id": "08632930",
"resource_kind": "company-profile",
"data": {
"jurisdiction": "england-wales",
"type": "ltd",
"sic_codes": [
"96090"
],
"company_number": "08632930",
"last_full_members_list_date": "2015-08-01",
"confirmation_statement": {
"last_made_up_to": "2023-08-01",
"next_made_up_to": "2024-08-01",
"next_due": "2024-08-15"
},
"company_name": "AR ART CONSULTING LTD.",
"date_of_creation": "2013-08-01",
"registered_office_address": {
"address_line_1": "111a High Street",
"locality": "Harrow",
"care_of": "MUNNA MANJI ACCOUNTANTS",
"address_line_2": "Wealdstone",
"postal_code": "HA3 5DL",
"region": "Middlesex"
},
"company_status": "active",
"etag": "a642233483044571b496286e48ec1ff32dc94469",
"links": {
"self": "/company/08632930",
"persons_with_significant_control": "/company/08632930/persons-with-significant-control",
"filing_history": "/company/08632930/filing-history",
"officers": "/company/08632930/officers"
},
"accounts": {
"next_made_up_to": "2024-08-31",
"last_accounts": {
"period_end_on": "2023-08-31",
"made_up_to": "2023-08-31",
"period_start_on": "2022-09-01",
"type": "total-exemption-full"
},
"next_accounts": {
"period_end_on": "2024-08-31",
"due_on": "2025-05-31",
"period_start_on": "2023-09-01"
},
"next_due": "2025-05-31",
"accounting_reference_date": {
"month": "08",
"day": "31"
}
},
"can_file": {
"BOOL": true
}
},
"event": {
"published_at": "2024-01-23T22:21:05",
"type": "changed",
"timepoint": "72513023"
},
"resource_uri": "/company/08632930"
}
}
]
}