We will look into the People with significant control (PSC) snapshot data which can be downloaded from Companies House website. Snapshot data file is provided in JSON format and can be downloaded as a single file or as multiple files for ease of downloading.
We will use single file in Google CoLab to carry out basic data analysis.
Data fields
“The snapshot files will provide a copy of director, secretary, LLP member and SE member details as they are held in the Companies House Directors Register at a given point in time. These details are provided for each company, limited liability partnership (LLP) and Societas Europaea (SE) and, as such, the snapshot is a corporate body based file.” ref: Company Appointments Snapshot (Product 195) and Update (Product 198)
root
|-- company_number: string (nullable = true)
|-- data: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- address_line_1: string (nullable = true)
| | |-- address_line_2: string (nullable = true)
| | |-- country: string (nullable = true)
| | |-- locality: string (nullable = true)
| | |-- po_box: string (nullable = true)
| | |-- postal_code: string (nullable = true)
| | |-- premises: string (nullable = true)
| | |-- region: string (nullable = true)
| |-- ceased_on: string (nullable = true)
| |-- country_of_residence: string (nullable = true)
| |-- date_of_birth: struct (nullable = true)
| | |-- month: long (nullable = true)
| | |-- year: long (nullable = true)
| |-- etag: string (nullable = true)
| |-- kind: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- self: string (nullable = true)
| |-- name: string (nullable = true)
| |-- name_elements: struct (nullable = true)
| | |-- forename: string (nullable = true)
| | |-- middle_name: string (nullable = true)
| | |-- surname: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- nationality: string (nullable = true)
| |-- natures_of_control: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- notified_on: string (nullable = true)
Size of the file
A full snapshot of current appointments is approximately 8Gb of data after unzipping. It contains the following number of records for:
- Company – 7888535
- Appointments – 11165674 records
spark.sql("""select
count(distinct company_number) from psc""").show(truncate=False)
+------------------------------+
|count(DISTINCT company_number)|
+------------------------------+
|7888535 |
+------------------------------+
Diagnostic GroupBy started(limit 20 records): 202302051844
+--------------+----------+-----+----------+
|company_number|total_rows|count|proportion|
+--------------+----------+-----+----------+
|null |11165674 |1 |0.0 |
|00000086 |11165674 |1 |0.0 |
|00000118 |11165674 |2 |0.0 |
|00000121 |11165674 |1 |0.0 |
|00000133 |11165674 |1 |0.0 |
|00000140 |11165674 |1 |0.0 |
|00000182 |11165674 |1 |0.0 |
|00000295 |11165674 |2 |0.0 |
|00000371 |11165674 |1 |0.0 |
|00000402 |11165674 |1 |0.0 |
|00000452 |11165674 |2 |0.0 |
|00000529 |11165674 |1 |0.0 |
|00000537 |11165674 |1 |0.0 |
|00000599 |11165674 |1 |0.0 |
|00000687 |11165674 |1 |0.0 |
|00000827 |11165674 |1 |0.0 |
|00000866 |11165674 |1 |0.0 |
|00000871 |11165674 |1 |0.0 |
|00000950 |11165674 |1 |0.0 |
|00001122 |11165674 |1 |0.0 |
+--------------+----------+-----+----------+
None
+---------------------------+----------+-------+----------+
|address_line_1 |total_rows|count |proportion|
+---------------------------+----------+-------+----------+
|null |11165674 |1177960|0.11 |
|! Osmond Way |11165674 |1 |0.0 |
|! Owl End |11165674 |1 |0.0 |
|!, Southend Cottage |11165674 |1 |0.0 |
|!0 Elm Road |11165674 |1 |0.0 |
|!01 Blackfriars |11165674 |1 |0.0 |
|!6-18 Devonshire Road |11165674 |1 |0.0 |
|!72 Kew Road |11165674 |1 |0.0 |
|!St Floor |11165674 |2 |0.0 |
|" Omkar", Vhbcs Layout |11165674 |1 |0.0 |
|"2a Maygrove Road |11165674 |1 |0.0 |
|"Anndale", Stannerford Road|11165674 |1 |0.0 |
|"Auchengibbert" |11165674 |3 |0.0 |
|"Beverley House" |11165674 |1 |0.0 |
|"Corinya" |11165674 |1 |0.0 |
|"D-R Ivan Basanovich" Str. |11165674 |1 |0.0 |
|"Donore" |11165674 |1 |0.0 |
|"Glenesk",Lade Braes |11165674 |1 |0.0 |
|"Lovage" |11165674 |1 |0.0 |
|"Lyndhurst" |11165674 |1 |0.0 |
+---------------------------+----------+-------+----------+
None
+----------------------+----------+-------+----------+
|country |total_rows|count |proportion|
+----------------------+----------+-------+----------+
|null |11165674 |2064726|0.18 |
| Guernsey |11165674 |1 |0.0 |
|(Netherlands Antilles)|11165674 |5 |0.0 |
|(Northern Ireland) |11165674 |15 |0.0 |
|. |11165674 |15 |0.0 |
|01067590321 |11165674 |1 |0.0 |
|0250 Oslo |11165674 |1 |0.0 |
|100048 |11165674 |1 |0.0 |
|10022 |11165674 |1 |0.0 |
|103-8686 Japan |11165674 |1 |0.0 |
|1204 Switzerland |11165674 |5 |0.0 |
|122154 |11165674 |1 |0.0 |
|200070 |11165674 |1 |0.0 |
|2106 |11165674 |1 |0.0 |
|225-002 Japan |11165674 |1 |0.0 |
|250132 |11165674 |1 |0.0 |
|432708 |11165674 |1 |0.0 |
|510280 China |11165674 |1 |0.0 |
|7 Rue Robert Stumper |11165674 |1 |0.0 |
|7777788888 |11165674 |4 |0.0 |
+----------------------+----------+-------+----------+
None
+---------------------+----------+------+----------+
|ocality |total_rows|count |proportion|
+---------------------+----------+------+----------+
|null |11165674 |695554|0.06 |
| 21 Old Channel Road |11165674 |2 |0.0 |
| Angus |11165674 |1 |0.0 |
| Ashton Under Lyne |11165674 |2 |0.0 |
| Bellshill |11165674 |1 |0.0 |
| Blaenannerch |11165674 |1 |0.0 |
| Bucks |11165674 |1 |0.0 |
| Cambs |11165674 |1 |0.0 |
| Co Antrim |11165674 |5 |0.0 |
| Co Ondonderry |11165674 |2 |0.0 |
| Devon |11165674 |3 |0.0 |
| District 11 |11165674 |1 |0.0 |
| Doncaster |11165674 |2 |0.0 |
| Dyfed |11165674 |1 |0.0 |
| Headley Down |11165674 |1 |0.0 |
| Ind, Hemel Hempstead|11165674 |1 |0.0 |
| Ipswich |11165674 |2 |0.0 |
| Liskeard |11165674 |1 |0.0 |
| London |11165674 |1 |0.0 |
| Normanton Ind Est |11165674 |1 |0.0 |
+---------------------+----------+------+----------+
None
+---------------+----------+------+----------+
|postal_code |total_rows|count |proportion|
+---------------+----------+------+----------+
|null |11165674 |805840|0.07 |
|!G10 4PL |11165674 |1 |0.0 |
|#03-129 |11165674 |1 |0.0 |
|#03-73 |11165674 |1 |0.0 |
|#05-04 |11165674 |1 |0.0 |
|#30-01 |11165674 |1 |0.0 |
|#32-01 |11165674 |1 |0.0 |
|#3532 |11165674 |1 |0.0 |
|'6562 |11165674 |1 |0.0 |
|(050336) |11165674 |1 |0.0 |
|(10500) |11165674 |1 |0.0 |
|(238859) |11165674 |1 |0.0 |
|(427-724) |11165674 |1 |0.0 |
|(CEP) 04571 010|11165674 |1 |0.0 |
|(S-105) |11165674 |1 |0.0 |
|*** |11165674 |7 |0.0 |
|+23484 |11165674 |1 |0.0 |
|+263 |11165674 |1 |0.0 |
|+447732232184 |11165674 |1 |0.0 |
|+5001 |11165674 |1 |0.0 |
+---------------+----------+------+----------+
None
+-------------------------+----------+------+----------+
|premises |total_rows|count |proportion|
+-------------------------+----------+------+----------+
|null |11165674 |959425|0.09 |
|! |11165674 |2 |0.0 |
|! Beacon Buildings |11165674 |1 |0.0 |
|! Horsingtons Yard |11165674 |1 |0.0 |
|! Jesmond Business Court |11165674 |1 |0.0 |
|! Treadgold Warehouse |11165674 |1 |0.0 |
|!0 |11165674 |1 |0.0 |
|!0 Harrington Road |11165674 |1 |0.0 |
|!2 |11165674 |1 |0.0 |
|!2a |11165674 |2 |0.0 |
|!5, Mulberry Court |11165674 |1 |0.0 |
|!6 |11165674 |3 |0.0 |
|!7 |11165674 |1 |0.0 |
|!8 |11165674 |1 |0.0 |
|!8 Charman House |11165674 |1 |0.0 |
|!8a |11165674 |1 |0.0 |
|!A |11165674 |2 |0.0 |
|!C |11165674 |1 |0.0 |
|!St Floor Cavendish House|11165674 |1 |0.0 |
|!St Floor Tudor Cottage |11165674 |1 |0.0 |
+-------------------------+----------+------+----------+
None
+----------------------+----------+-------+----------+
|country_of_residence |total_rows|count |proportion|
+----------------------+----------+-------+----------+
|null |11165674 |1458560|0.13 |
| Australia |11165674 |1 |0.0 |
| Brunei Darrusalam |11165674 |1 |0.0 |
| India |11165674 |1 |0.0 |
| Ireland |11165674 |1 |0.0 |
| Netherlands |11165674 |2 |0.0 |
| Sweden |11165674 |1 |0.0 |
| Thailand |11165674 |1 |0.0 |
| United Kinddom |11165674 |1 |0.0 |
| United Kingdom |11165674 |2 |0.0 |
|(Multiple) |11165674 |1 |0.0 |
|(Netherlands Antilles)|11165674 |1 |0.0 |
|(North) Cyprus |11165674 |1 |0.0 |
|(Northern Ireland) |11165674 |15 |0.0 |
|(Vr) Italy |11165674 |2 |0.0 |
|,Exico |11165674 |1 |0.0 |
|. |11165674 |10 |0.0 |
|:Polish |11165674 |1 |0.0 |
|<Country></Country> |11165674 |3 |0.0 |
|Aabenraa, Denmark |11165674 |1 |0.0 |
+----------------------+----------+-------+----------+
None
+-------------------+----------+-------+----------+
|date_of_birth_month|total_rows|count |proportion|
+-------------------+----------+-------+----------+
|null |11165674 |1438982|0.13 |
|1 |11165674 |854929 |0.08 |
|2 |11165674 |751255 |0.07 |
|3 |11165674 |846487 |0.08 |
|4 |11165674 |815007 |0.07 |
|5 |11165674 |849474 |0.08 |
|6 |11165674 |814361 |0.07 |
|7 |11165674 |830929 |0.07 |
|8 |11165674 |816054 |0.07 |
|9 |11165674 |810701 |0.07 |
|10 |11165674 |805091 |0.07 |
|11 |11165674 |756869 |0.07 |
|12 |11165674 |775535 |0.07 |
+-------------------+----------+-------+----------+
None
+------------------+----------+-------+----------+
|date_of_birth_year|total_rows|count |proportion|
+------------------+----------+-------+----------+
|null |11165674 |1438982|0.13 |
|1079 |11165674 |1 |0.0 |
|1089 |11165674 |1 |0.0 |
|1355 |11165674 |1 |0.0 |
|1609 |11165674 |1 |0.0 |
|1654 |11165674 |1 |0.0 |
|1696 |11165674 |1 |0.0 |
|1698 |11165674 |3 |0.0 |
|1775 |11165674 |1 |0.0 |
|1790 |11165674 |1 |0.0 |
|1795 |11165674 |1 |0.0 |
|1864 |11165674 |1 |0.0 |
|1875 |11165674 |1 |0.0 |
|1885 |11165674 |1 |0.0 |
|1897 |11165674 |2 |0.0 |
|1898 |11165674 |1 |0.0 |
|1899 |11165674 |1 |0.0 |
|1900 |11165674 |19 |0.0 |
|1901 |11165674 |6 |0.0 |
|1903 |11165674 |3 |0.0 |
+------------------+----------+-------+----------+
None
+----------------------------------------+----------+-----+----------+
|etag |total_rows|count|proportion|
+----------------------------------------+----------+-----+----------+
|null |11165674 |1 |0.0 |
|0000004904b433bcf0cf48d45280e7c95753bab5|11165674 |1 |0.0 |
|000003d2b80da00c2b38780d09e23553d5b63787|11165674 |1 |0.0 |
|000007d09b4a95b381542c05ac59da3f4f824a28|11165674 |1 |0.0 |
|000008e7c9debfa43b3f6fe004e93175df3ddf1e|11165674 |1 |0.0 |
|000008eaf57249f4133e19b56002f6c0b36b1361|11165674 |1 |0.0 |
|000009b0b45bc791db15ee1338f52a424c2e16f4|11165674 |1 |0.0 |
|00000d0f02871af168e4084e99a9e55ee135c99b|11165674 |1 |0.0 |
|000013a839288556fa455c595712f4f05b21876c|11165674 |1 |0.0 |
|00001598895cd06984aa146b512e120c24d175b0|11165674 |1 |0.0 |
|000017dc191abf923b1e106b2b35e1dfc652fdb2|11165674 |1 |0.0 |
|00001a6d63e36726bdedf695c58798633d21a072|11165674 |1 |0.0 |
|00001cb3ca996e2c352cc5521f631817d58c643b|11165674 |1 |0.0 |
|00001d0b7595254cbdfd48cd3b6c286b79229339|11165674 |1 |0.0 |
|00001fa246acbc3825538df08e7e77f99e1953d2|11165674 |1 |0.0 |
|00001fb1d231410c260d9f113efa6d0d48d4cda2|11165674 |1 |0.0 |
|0000251b8be3822ffd18237133da62df4a045b64|11165674 |1 |0.0 |
|000028ac4731566d5b451731b53ad0912c499029|11165674 |1 |0.0 |
|000028e3e2b063fbe2b4ba32323fec8b0cb42c9d|11165674 |1 |0.0 |
|00002ad805287f6473f2dd1178a0c4a71e641c37|11165674 |1 |0.0 |
+----------------------------------------+----------+-----+----------+
None
+------------------------------------------------+----------+-------+----------+
|kind |total_rows|count |proportion|
+------------------------------------------------+----------+-------+----------+
|corporate-entity-beneficial-owner |11165674 |8280 |0.0 |
|corporate-entity-person-with-significant-control|11165674 |754137 |0.07 |
|exemptions |11165674 |74 |0.0 |
|individual-beneficial-owner |11165674 |20002 |0.0 |
|individual-person-with-significant-control |11165674 |9707114|0.87 |
|legal-person-beneficial-owner |11165674 |356 |0.0 |
|legal-person-person-with-significant-control |11165674 |13119 |0.0 |
|persons-with-significant-control-statement |11165674 |662243 |0.06 |
|super-secure-person-with-significant-control |11165674 |348 |0.0 |
|totals#persons-of-significant-control-snapshot |11165674 |1 |0.0 |
+------------------------------------------------+----------+-------+----------+
None
+-------------------------------------------------------------------------------------------------+----------+-----+----------+
|links |total_rows|count|proportion|
+-------------------------------------------------------------------------------------------------+----------+-----+----------+
|null |11165674 |1 |0.0 |
|{/company/00000086/persons-with-significant-control-statements/Jx8L7o209uDh1F8zJJ2qKorRwOc} |11165674 |1 |0.0 |
|{/company/00000118/persons-with-significant-control-statements/z7i5TYnBmWyFrG7AkPJBwr6UjfE} |11165674 |1 |0.0 |
|{/company/00000118/persons-with-significant-control/corporate-entity/wpoPoCDyGKSkIqg-IYm2fH_SyKo}|11165674 |1 |0.0 |
|{/company/00000121/persons-with-significant-control-statements/F5C2_Xq_WX3CxXO8MqTBSaknNwg} |11165674 |1 |0.0 |
|{/company/00000133/persons-with-significant-control/corporate-entity/KEDWxS-rs5givHEdBMz1OJMKMu4}|11165674 |1 |0.0 |
|{/company/00000140/persons-with-significant-control/legal-person/jLDoDjO8FrfvqENk9t6dITvkCs0} |11165674 |1 |0.0 |
|{/company/00000182/persons-with-significant-control/corporate-entity/zazZhwv3El93n8YEr0WyLQV_drE}|11165674 |1 |0.0 |
|{/company/00000295/persons-with-significant-control/individual/E_53PisIXIw_18aMtlyVE9v2Okg} |11165674 |1 |0.0 |
|{/company/00000295/persons-with-significant-control/individual/tQVKwAdPtDSSk6eY-C3vuV3pZYY} |11165674 |1 |0.0 |
|{/company/00000371/persons-with-significant-control/individual/KobFaFniBauWoOvTWqlnIxW7AB0} |11165674 |1 |0.0 |
|{/company/00000402/persons-with-significant-control/corporate-entity/G_HfV07BhFNT3Ppu27hro1V1H5o}|11165674 |1 |0.0 |
|{/company/00000452/persons-with-significant-control/individual/4MMoVHnHgJaZNjeMdcOSwOpja4I} |11165674 |1 |0.0 |
|{/company/00000452/persons-with-significant-control/individual/VEzGWVLMk9rGw_yt1QLKrgJeHaY} |11165674 |1 |0.0 |
|{/company/00000529/persons-with-significant-control/corporate-entity/BudQseMp7pgowMJlrLPCCxia9RA}|11165674 |1 |0.0 |
|{/company/00000537/persons-with-significant-control/corporate-entity/ASY8o9g64OX414YRoAfXwj9cTdM}|11165674 |1 |0.0 |
|{/company/00000599/persons-with-significant-control/corporate-entity/0qb1GkhVg4YXHo88ZlfhNzKD1X4}|11165674 |1 |0.0 |
|{/company/00000687/persons-with-significant-control/corporate-entity/LYXJYcU_4dYL31o2xsJyn8drQxw}|11165674 |1 |0.0 |
|{/company/00000827/persons-with-significant-control-statements/ZmZNQgs4DOANB2jnqe4QNjVbBWA} |11165674 |1 |0.0 |
|{/company/00000866/persons-with-significant-control/corporate-entity/gfIviBeZ1wc4xDPzn4lG-GHivs8}|11165674 |1 |0.0 |
+-------------------------------------------------------------------------------------------------+----------+-----+----------+
None
Diagnostic GroupBy started(limit 20 records): 202302052019
+--------------+----------+-------+----------+
|nationality |total_rows|count |proportion|
+--------------+----------+-------+----------+
|null |11165674 |1438563|0.13 |
|,Indian |11165674 |1 |0.0 |
|- |11165674 |1 |0.0 |
|. |11165674 |1 |0.0 |
|.Indian |11165674 |1 |0.0 |
|13/05/1981 |11165674 |1 |0.0 |
|515520 British|11165674 |1 |0.0 |
|Abanian |11165674 |1 |0.0 |
|Accountant |11165674 |1 |0.0 |
|Acottish |11165674 |1 |0.0 |
|Afgahn |11165674 |1 |0.0 |
|Afganistan |11165674 |3 |0.0 |
|Afganistani |11165674 |1 |0.0 |
|Afghan |11165674 |6519 |0.0 |
|Afghan British|11165674 |1 |0.0 |
|Afghan, |11165674 |1 |0.0 |
|Afghan,British|11165674 |9 |0.0 |
|Afghan,Turkish|11165674 |1 |0.0 |
|Afghani |11165674 |2 |0.0 |
|Afghanistan |11165674 |11 |0.0 |
+--------------+----------+-------+----------+
None
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
|natures_of_control |total_rows|count |proportion|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
|null |11165674 |662754|0.06 |
|[ownership-of-shares-25-to-50-percent] |11165674 |939481|0.08 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm] |11165674 |3219 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors] |11165674 |239 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm] |11165674 |26 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control, significant-influence-or-control-as-firm] |11165674 |6 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm] |11165674 |2 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-trust] |11165674 |1 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust] |11165674 |1 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm] |11165674 |20 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control, significant-influence-or-control-as-trust, significant-influence-or-control-as-firm]|11165674 |5 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm] |11165674 |2 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control] |11165674 |17 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control-as-firm] |11165674 |16 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control-as-trust] |11165674 |1 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm] |11165674 |10 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control] |11165674 |2 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm] |11165674 |8 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-trust] |11165674 |1 |0.0 |
|[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-trust] |11165674 |19 |0.0 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
None
Diagnostic GroupBy started(limit 20 records): 202302052125
+-----------+----------+-----+----------+
|notified_on|total_rows|count|proportion|
+-----------+----------+-----+----------+
|null |11165674 |423 |0.0 |
|1083-01-01 |11165674 |1 |0.0 |
|1776-07-04 |11165674 |1 |0.0 |
|1827-01-01 |11165674 |1 |0.0 |
|1918-07-24 |11165674 |1 |0.0 |
|1919-03-14 |11165674 |1 |0.0 |
|1930-11-21 |11165674 |1 |0.0 |
|1939-03-29 |11165674 |1 |0.0 |
|1942-05-01 |11165674 |1 |0.0 |
|1942-07-10 |11165674 |1 |0.0 |
|1949-10-01 |11165674 |2 |0.0 |
|1950-03-10 |11165674 |1 |0.0 |
|1950-04-18 |11165674 |1 |0.0 |
|1955-07-01 |11165674 |1 |0.0 |
|1960-01-14 |11165674 |1 |0.0 |
|1960-03-01 |11165674 |1 |0.0 |
|1961-10-04 |11165674 |1 |0.0 |
|1962-09-16 |11165674 |1 |0.0 |
|1964-09-04 |11165674 |2 |0.0 |
|1965-07-29 |11165674 |1 |0.0 |
+-----------+----------+-----+----------+
None
Diagnostic Group By completed: 0:05:41.158327
PySpark and BigQuery
We have used PySpark to process snapshot JSON file and BigQuery for data storage.
# https://download.companieshouse.gov.uk/en_pscdata.html
def get_ch_psc_date():
!wget https://download.companieshouse.gov.uk/persons-with-significant-control-snapshot-2023-02-04.zip
!unzip persons-with-significant-control-snapshot-2023-02-04.zip -d /tmp/
return True
get_ch_psc_date()
# reading JSON file
df=spark.read.json('file:///tmp/persons-with-significant-control-snapshot-2023-02-04.txt',schema=schema)
# create a temp table
df.createOrReplaceTempView('ch_psc')
# extract data
df_psc = spark.sql(f"""
select
company_number,
data.address.address_line_1 as address_line_1,
data.address.country as country,
data.address.locality as locality,
data.address.postal_code as postal_code,
data.address.premises as premises,
data.country_of_residence as country_of_residence,
data.date_of_birth.month as date_of_birth_month,
data.date_of_birth.year as date_of_birth_year,
data.etag as etag,
data.kind as kind,
data.links as links,
data.name as name,
data.name_elements.forename as forename,
data.name_elements.middle_name as middle_name,
data.name_elements.surname as surname,
data.nationality as nationality,
data.natures_of_control as natures_of_control,
data.notified_on as notified_on
from
ch_psc
""")
# Saving dataframe to BigQuery
df_psc.write \
.format("bigquery") \
.option("temporaryGcsBucket", "ellodata/temp/") \
.option("parentProject", "first-skein-329508") \
.option("dataset", "utcdb") \
.mode("overwrite") \
.save("first-skein-329508.utcdb.ch_psc2023")