Looking at Company House Profile stream data

  • Streaming process to extract data from the end point
    • HTTP library for sending GET request and handling response from the server. i.e. response = requests(url, auth=(api_key, ”), stream=True)
  • Process response data and save it to a persistent storage. There are many options:
    • kafka
    • relational database – PostgreSQL
    • cloud storage like Google GS, BigQuery or AWS DynamoDB, S3
  • Deploy streaming ETL script in your local server or your preferred cloud provider using serverless functions such as Google Function or AWS Lamda.
SQLite 
sqlite3 -header -csv ./data/shared-local-instance.db "select * from company_profile;" > company_profile.csv

or using AWS command line tool
aws dynamodb scan --table-name company_profile --endpoint-url http://localhost:8000   --max-items 2 --output json > ./export.json
DynamoDB output
  • Extract data from DynamoDB LOcal using aws command line tool – or SQLite database. eg. aws dynamodb scan –table-name company_profile –endpoint-url http://localhost:8000   –max-items 2 –output json > ./export.json
  • Read exported file and convert it to normal JSON format
  • Create a dataframe for further use or data analysis – Pandas or Spark (provided you are running Spark session)
DATA_TYPES = {
  "S": lambda x: str(x),
  "B": lambda x: bool(x),
  "N": lambda x: str(x),
  "L": lambda x: cast_value_by_type(x),
  "M": lambda x: cast_value_by_type(x),
}

# convert DynamoDB json to normal JSON
def cast_value_by_type(raw):
  raw_type = type(raw)
  
  if raw_type is list:
    raw_list = []
    for index in raw:
      raw_list.append(cast_value_by_type(index))
    raw = raw_list
    return raw
                
  elif raw_type is dict:
    for key in raw.keys():
      try:
        raw = DATA_TYPES[key](raw[key])
        break
      except:
        raw[key] = cast_value_by_type(raw[key])
    return raw
# read DynamoDB Json file
file_path = "/content/export.json"
dynamo_data = json.loads(open(file_path, 'r').read())
# DynamoDB json to json
dynamo_json_data = cast_value_by_type(dynamo_data)

# create Pandas dataframe
import pandas as pd
df = pd.read_json("/content/dump-export.json")

# create another dataframe for nested json field
df1 = pd.json_normalize(df['Items'])
DynamoDB json to normal JSON

Company Profile data items description

Ref: https://developer-specs.company-information.service.gov.uk/streaming-api/resources/companyprofilestream?v=latest

The company Profile resource data.
Company accounts information.
The Accounting Reference Date (ARD) of the company.
The Accounting Reference Date (ARD) day.
The Accounting Reference Date (ARD) month.
The last company accounts filed.
The date the last company accounts were made up to.
The type of the last company accounts filed.
For enumeration descriptions see account_type section in the enumeration mappings.
Possible values are:
null
full
small
medium
group
dormant
interim
initial
total-exemption-full
total-exemption-small
partial-exemption
audit-exemption-subsidiary
filing-exemption-subsidiary
micro-entity
no-accounts-type-available
audited-abridged
unaudited-abridged
The date the next company accounts are due.
The date the next company accounts should be made up to.
Flag indicating if the company accounts are overdue.
Annual return information. This member is only returned if a confirmation statement has not be filed.
The date the last annual return was made up to.
The date the next annual return is due. This member will only be returned if a confirmation statement has not been filed and the date is before 28th July 2016, otherwise refer to confirmation_statement.next_due
The date the next annual return should be made up to. This member will only be returned if a confirmation statement has not been filed and the date is before 30th July 2016, otherwise refer to confirmation_statement.next_made_up_to
Flag indicating if the annual return is overdue.
UK branch of a foreign company.
Type of business undertaken by the UK establishment.
Parent company name.
Parent company number.
Flag indicating whether this company can file.
The name of the company.
The number of the company.
The status of the company.
For enumeration descriptions see company_status section in the enumeration mappings
Possible values are:
active
dissolved
liquidation
receivership
administration
voluntary-arrangement
converted-closed
insolvency-proceedings
registered
removed
closed
open
Extra details about the status of the company.
For enumeration descriptions see company_status_detail section in the enumeration mappings.
Possible values are:
transferred-from-uk
active-proposal-to-strike-off
petition-to-restore-dissolved
transformed-to-se
converted-to-plc
Confirmation statement information (N.B. refers to the Annual Statement where type is registered-overseas-entity)
The date to which the company last made a confirmation statement.
The date by which the next confimation statement must be received.
The date to which the company must next make a confirmation statement.
Flag indicating if the confirmation statement is overdue
The date which the company was converted/closed, dissolved or removed. Please refer to company status to determine which.
The date when the company was created.
The ETag of the resource.
Foreign company details.
Accounts requirement.
Type of accounting requirement that applies.
For enumeration descriptions see foreign_account_type section in the enumeration mappings.
Possible values are:
accounting-requirements-of-originating-country-apply
accounting-requirements-of-originating-country-do-not-apply
Describes how the publication date is derived.
For enumeration descriptions see terms_of_account_publication section in the enumeration mappings.
Possible values are:
accounts-publication-date-supplied-by-company
accounting-publication-date-does-not-need-to-be-supplied-by-company
accounting-reference-date-allocated-by-companies-house
Foreign company account information.
Date account period starts under parent law.
Day on which accounting period starts under parent law.
Month in which accounting period starts under parent law.
Date account period ends under parent law.
Day on which accounting period ends under parent law.
Month in which accounting period ends under parent law.
Time allowed from period end for disclosure of accounts under parent law.
Number of months within which to file.
Type of business undertaken by the company.
Legal form of the company in the country of incorporation.
Law governing the company in country of incorporation.
Is it a financial or credit institution.
Company origin informations
Country in which company was incorporated.
Identity of register in country of incorporation.
Registration number in company of incorporation.
The flag indicating if the company has been liquidated in the past.
The flag indicating if the company has any charges.
The flag indicating if the company has insolvency history.
The flag indicating if the company is a Community Interest Company.
The jurisdiction specifies the political body responsible for the company.
Possible values are:
england-wales
wales
scotland
northern-ireland
european-union
united-kingdom
england
noneu
The date of last full members list update.
A set of URLs related to the resource, including self.
The URL of the persons with significant control list resource.
The URL of the persons with significant control statements list resource.
The URL of the registers resource for this company
The URL of the resource.
The previous names of this company.
The date on which the company name ceased.
The date from which the company name was effective.
The previous company name
The address of the company’s registered office.
The first line of the address.
The second line of the address.
The care of name.
The country.
Possible values are:
Wales
England
Scotland
Great Britain
Not specified
United Kingdom
Northern Ireland
The locality e.g London.
The post-office box number.
The postal code e.g CF14 3UZ.
The property name or number.
The region e.g Surrey.
Flag indicating registered office address as been replaced.
The correspondence address of a Registered overseas entity
The first line of the address.
The second line of the address.
The care of name.
The country e.g. United Kingdom.
The locality e.g London.
The post-office box number.
The postal code e.g CF14 3UZ.
The region e.g Surrey.
SIC codes for this company.
The total count of super secure managing officers for a registered-overseas-entity.
The type of the company.
For enumeration descriptions see company_type section in the enumeration mappings
Possible values are:
private-unlimited
ltd
plc
old-public-company
private-limited-guarant-nsc-limited-exemption
limited-partnership
private-limited-guarant-nsc
converted-or-closed
private-unlimited-nsc
private-limited-shares-section-30-exemption
protected-cell-company
assurance-company
oversea-company
eeig
icvc-securities
icvc-warrant
icvc-umbrella
registered-society-non-jurisdictional
industrial-and-provident-society
northern-ireland
northern-ireland-other
royal-charter
investment-company-with-variable-capital
unregistered-company
llp
other
european-public-limited-liability-company-se
uk-establishment
scottish-partnership
charitable-incorporated-organisation
scottish-charitable-incorporated-organisation
further-education-or-sixth-form-college-corporation
registered-overseas-entity
Flag indicating whether post can be delivered to the registered office.
Link to the related resource
Array of fields that have been changed by this event. Nested fields are referenced by dot notation e.g. links.document_metadata
The date and time the data notification was raised
The point-in-time identifier for this stream document. Use to re-establish a connection to the stream at this point.
The type of event denoted by this stream document.
Possible values are:
changed
deleted
The ID of the resource.
The type of resource contained within the stream document.
Possible values are:
company-profile#company-profile
filing-history#filing-history
The URI of the resource.
Company Profile data
{
   "Item": [
      {
         "resource_id": "08632930",
         "data.company_status": "active",
         "item": {
            "resource_id": "08632930",
            "resource_kind": "company-profile",
            "data": {
               "jurisdiction": "england-wales",
               "type": "ltd",
               "sic_codes": [
                  "96090"
               ],
               "company_number": "08632930",
               "last_full_members_list_date": "2015-08-01",
               "confirmation_statement": {
                  "last_made_up_to": "2023-08-01",
                  "next_made_up_to": "2024-08-01",
                  "next_due": "2024-08-15"
               },
               "company_name": "AR ART CONSULTING LTD.",
               "date_of_creation": "2013-08-01",
               "registered_office_address": {
                  "address_line_1": "111a High Street",
                  "locality": "Harrow",
                  "care_of": "MUNNA MANJI ACCOUNTANTS",
                  "address_line_2": "Wealdstone",
                  "postal_code": "HA3 5DL",
                  "region": "Middlesex"
               },
               "company_status": "active",
               "etag": "a642233483044571b496286e48ec1ff32dc94469",
               "links": {
                  "self": "/company/08632930",
                  "persons_with_significant_control": "/company/08632930/persons-with-significant-control",
                  "filing_history": "/company/08632930/filing-history",
                  "officers": "/company/08632930/officers"
               },
               "accounts": {
                  "next_made_up_to": "2024-08-31",
                  "last_accounts": {
                     "period_end_on": "2023-08-31",
                     "made_up_to": "2023-08-31",
                     "period_start_on": "2022-09-01",
                     "type": "total-exemption-full"
                  },
                  "next_accounts": {
                     "period_end_on": "2024-08-31",
                     "due_on": "2025-05-31",
                     "period_start_on": "2023-09-01"
                  },
                  "next_due": "2025-05-31",
                  "accounting_reference_date": {
                     "month": "08",
                     "day": "31"
                  }
               },
               "can_file": {
                  "BOOL": true
               }
            },
            "event": {
               "published_at": "2024-01-23T22:21:05",
               "type": "changed",
               "timepoint": "72513023"
            },
            "resource_uri": "/company/08632930"
         }
      }
   ]
}