Tag: PySpark

  • Processing UK Companies House PSC Data

    We will look into the People with significant control (PSC) snapshot data which can be downloaded from Companies House website. Snapshot data file is provided in JSON format and can be downloaded as a single file or as multiple files for ease of downloading.

    We will use single file in Google CoLab to carry out basic data analysis.

    Data fields

    “The snapshot files will provide a copy of director, secretary, LLP member and SE member details as they are held in the Companies House Directors Register at a given point in time.  These details are provided for each company, limited liability partnership (LLP) and Societas Europaea (SE) and, as such, the snapshot is a corporate body based file.” ref: Company Appointments Snapshot (Product 195) and Update (Product 198)

    root
     |-- company_number: string (nullable = true)
     |-- data: struct (nullable = true)
     |    |-- address: struct (nullable = true)
     |    |    |-- address_line_1: string (nullable = true)
     |    |    |-- address_line_2: string (nullable = true)
     |    |    |-- country: string (nullable = true)
     |    |    |-- locality: string (nullable = true)
     |    |    |-- po_box: string (nullable = true)
     |    |    |-- postal_code: string (nullable = true)
     |    |    |-- premises: string (nullable = true)
     |    |    |-- region: string (nullable = true)
     |    |-- ceased_on: string (nullable = true)
     |    |-- country_of_residence: string (nullable = true)
     |    |-- date_of_birth: struct (nullable = true)
     |    |    |-- month: long (nullable = true)
     |    |    |-- year: long (nullable = true)
     |    |-- etag: string (nullable = true)
     |    |-- kind: string (nullable = true)
     |    |-- links: struct (nullable = true)
     |    |    |-- self: string (nullable = true)
     |    |-- name: string (nullable = true)
     |    |-- name_elements: struct (nullable = true)
     |    |    |-- forename: string (nullable = true)
     |    |    |-- middle_name: string (nullable = true)
     |    |    |-- surname: string (nullable = true)
     |    |    |-- title: string (nullable = true)
     |    |-- nationality: string (nullable = true)
     |    |-- natures_of_control: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
     |    |-- notified_on: string (nullable = true)
    

    Size of the file

    A full snapshot of current appointments is approximately 8Gb of data after unzipping. It contains the following number of records for:

    • Company – 7888535
    • Appointments – 11165674 records
    spark.sql("""select 
    count(distinct company_number) from psc""").show(truncate=False)
    +------------------------------+
    |count(DISTINCT company_number)|
    +------------------------------+
    |7888535                       |
    +------------------------------+
    
    Diagnostic GroupBy started(limit 20 records): 202302051844
    +--------------+----------+-----+----------+
    |company_number|total_rows|count|proportion|
    +--------------+----------+-----+----------+
    |null          |11165674  |1    |0.0       |
    |00000086      |11165674  |1    |0.0       |
    |00000118      |11165674  |2    |0.0       |
    |00000121      |11165674  |1    |0.0       |
    |00000133      |11165674  |1    |0.0       |
    |00000140      |11165674  |1    |0.0       |
    |00000182      |11165674  |1    |0.0       |
    |00000295      |11165674  |2    |0.0       |
    |00000371      |11165674  |1    |0.0       |
    |00000402      |11165674  |1    |0.0       |
    |00000452      |11165674  |2    |0.0       |
    |00000529      |11165674  |1    |0.0       |
    |00000537      |11165674  |1    |0.0       |
    |00000599      |11165674  |1    |0.0       |
    |00000687      |11165674  |1    |0.0       |
    |00000827      |11165674  |1    |0.0       |
    |00000866      |11165674  |1    |0.0       |
    |00000871      |11165674  |1    |0.0       |
    |00000950      |11165674  |1    |0.0       |
    |00001122      |11165674  |1    |0.0       |
    +--------------+----------+-----+----------+
    
    None
    +---------------------------+----------+-------+----------+
    |address_line_1             |total_rows|count  |proportion|
    +---------------------------+----------+-------+----------+
    |null                       |11165674  |1177960|0.11      |
    |! Osmond Way               |11165674  |1      |0.0       |
    |! Owl End                  |11165674  |1      |0.0       |
    |!, Southend Cottage        |11165674  |1      |0.0       |
    |!0 Elm Road                |11165674  |1      |0.0       |
    |!01 Blackfriars            |11165674  |1      |0.0       |
    |!6-18 Devonshire Road      |11165674  |1      |0.0       |
    |!72 Kew Road               |11165674  |1      |0.0       |
    |!St Floor                  |11165674  |2      |0.0       |
    |" Omkar", Vhbcs Layout     |11165674  |1      |0.0       |
    |"2a Maygrove Road          |11165674  |1      |0.0       |
    |"Anndale", Stannerford Road|11165674  |1      |0.0       |
    |"Auchengibbert"            |11165674  |3      |0.0       |
    |"Beverley House"           |11165674  |1      |0.0       |
    |"Corinya"                  |11165674  |1      |0.0       |
    |"D-R Ivan Basanovich" Str. |11165674  |1      |0.0       |
    |"Donore"                   |11165674  |1      |0.0       |
    |"Glenesk",Lade Braes       |11165674  |1      |0.0       |
    |"Lovage"                   |11165674  |1      |0.0       |
    |"Lyndhurst"                |11165674  |1      |0.0       |
    +---------------------------+----------+-------+----------+
    
    None
    +----------------------+----------+-------+----------+
    |country               |total_rows|count  |proportion|
    +----------------------+----------+-------+----------+
    |null                  |11165674  |2064726|0.18      |
    | Guernsey             |11165674  |1      |0.0       |
    |(Netherlands Antilles)|11165674  |5      |0.0       |
    |(Northern Ireland)    |11165674  |15     |0.0       |
    |.                     |11165674  |15     |0.0       |
    |01067590321           |11165674  |1      |0.0       |
    |0250 Oslo             |11165674  |1      |0.0       |
    |100048                |11165674  |1      |0.0       |
    |10022                 |11165674  |1      |0.0       |
    |103-8686 Japan        |11165674  |1      |0.0       |
    |1204 Switzerland      |11165674  |5      |0.0       |
    |122154                |11165674  |1      |0.0       |
    |200070                |11165674  |1      |0.0       |
    |2106                  |11165674  |1      |0.0       |
    |225-002 Japan         |11165674  |1      |0.0       |
    |250132                |11165674  |1      |0.0       |
    |432708                |11165674  |1      |0.0       |
    |510280 China          |11165674  |1      |0.0       |
    |7 Rue Robert Stumper  |11165674  |1      |0.0       |
    |7777788888            |11165674  |4      |0.0       |
    +----------------------+----------+-------+----------+
    
    None
    +---------------------+----------+------+----------+
    |ocality              |total_rows|count |proportion|
    +---------------------+----------+------+----------+
    |null                 |11165674  |695554|0.06      |
    | 21 Old Channel Road |11165674  |2     |0.0       |
    | Angus               |11165674  |1     |0.0       |
    | Ashton Under Lyne   |11165674  |2     |0.0       |
    | Bellshill           |11165674  |1     |0.0       |
    | Blaenannerch        |11165674  |1     |0.0       |
    | Bucks               |11165674  |1     |0.0       |
    | Cambs               |11165674  |1     |0.0       |
    | Co Antrim           |11165674  |5     |0.0       |
    | Co Ondonderry       |11165674  |2     |0.0       |
    | Devon               |11165674  |3     |0.0       |
    | District 11         |11165674  |1     |0.0       |
    | Doncaster           |11165674  |2     |0.0       |
    | Dyfed               |11165674  |1     |0.0       |
    | Headley Down        |11165674  |1     |0.0       |
    | Ind, Hemel Hempstead|11165674  |1     |0.0       |
    | Ipswich             |11165674  |2     |0.0       |
    | Liskeard            |11165674  |1     |0.0       |
    | London              |11165674  |1     |0.0       |
    | Normanton Ind Est   |11165674  |1     |0.0       |
    +---------------------+----------+------+----------+
    
    None
    +---------------+----------+------+----------+
    |postal_code    |total_rows|count |proportion|
    +---------------+----------+------+----------+
    |null           |11165674  |805840|0.07      |
    |!G10 4PL       |11165674  |1     |0.0       |
    |#03-129        |11165674  |1     |0.0       |
    |#03-73         |11165674  |1     |0.0       |
    |#05-04         |11165674  |1     |0.0       |
    |#30-01         |11165674  |1     |0.0       |
    |#32-01         |11165674  |1     |0.0       |
    |#3532          |11165674  |1     |0.0       |
    |'6562          |11165674  |1     |0.0       |
    |(050336)       |11165674  |1     |0.0       |
    |(10500)        |11165674  |1     |0.0       |
    |(238859)       |11165674  |1     |0.0       |
    |(427-724)      |11165674  |1     |0.0       |
    |(CEP) 04571 010|11165674  |1     |0.0       |
    |(S-105)        |11165674  |1     |0.0       |
    |***            |11165674  |7     |0.0       |
    |+23484         |11165674  |1     |0.0       |
    |+263           |11165674  |1     |0.0       |
    |+447732232184  |11165674  |1     |0.0       |
    |+5001          |11165674  |1     |0.0       |
    +---------------+----------+------+----------+
    
    None
    +-------------------------+----------+------+----------+
    |premises                 |total_rows|count |proportion|
    +-------------------------+----------+------+----------+
    |null                     |11165674  |959425|0.09      |
    |!                        |11165674  |2     |0.0       |
    |! Beacon Buildings       |11165674  |1     |0.0       |
    |! Horsingtons Yard       |11165674  |1     |0.0       |
    |! Jesmond Business Court |11165674  |1     |0.0       |
    |! Treadgold Warehouse    |11165674  |1     |0.0       |
    |!0                       |11165674  |1     |0.0       |
    |!0 Harrington Road       |11165674  |1     |0.0       |
    |!2                       |11165674  |1     |0.0       |
    |!2a                      |11165674  |2     |0.0       |
    |!5, Mulberry Court       |11165674  |1     |0.0       |
    |!6                       |11165674  |3     |0.0       |
    |!7                       |11165674  |1     |0.0       |
    |!8                       |11165674  |1     |0.0       |
    |!8 Charman House         |11165674  |1     |0.0       |
    |!8a                      |11165674  |1     |0.0       |
    |!A                       |11165674  |2     |0.0       |
    |!C                       |11165674  |1     |0.0       |
    |!St Floor Cavendish House|11165674  |1     |0.0       |
    |!St Floor Tudor Cottage  |11165674  |1     |0.0       |
    +-------------------------+----------+------+----------+
    
    None
    +----------------------+----------+-------+----------+
    |country_of_residence  |total_rows|count  |proportion|
    +----------------------+----------+-------+----------+
    |null                  |11165674  |1458560|0.13      |
    | Australia            |11165674  |1      |0.0       |
    | Brunei Darrusalam    |11165674  |1      |0.0       |
    | India                |11165674  |1      |0.0       |
    | Ireland              |11165674  |1      |0.0       |
    | Netherlands          |11165674  |2      |0.0       |
    | Sweden               |11165674  |1      |0.0       |
    | Thailand             |11165674  |1      |0.0       |
    | United Kinddom       |11165674  |1      |0.0       |
    | United Kingdom       |11165674  |2      |0.0       |
    |(Multiple)            |11165674  |1      |0.0       |
    |(Netherlands Antilles)|11165674  |1      |0.0       |
    |(North) Cyprus        |11165674  |1      |0.0       |
    |(Northern Ireland)    |11165674  |15     |0.0       |
    |(Vr) Italy            |11165674  |2      |0.0       |
    |,Exico                |11165674  |1      |0.0       |
    |.                     |11165674  |10     |0.0       |
    |:Polish               |11165674  |1      |0.0       |
    |<Country></Country>   |11165674  |3      |0.0       |
    |Aabenraa, Denmark     |11165674  |1      |0.0       |
    +----------------------+----------+-------+----------+
    
    None
    +-------------------+----------+-------+----------+
    |date_of_birth_month|total_rows|count  |proportion|
    +-------------------+----------+-------+----------+
    |null               |11165674  |1438982|0.13      |
    |1                  |11165674  |854929 |0.08      |
    |2                  |11165674  |751255 |0.07      |
    |3                  |11165674  |846487 |0.08      |
    |4                  |11165674  |815007 |0.07      |
    |5                  |11165674  |849474 |0.08      |
    |6                  |11165674  |814361 |0.07      |
    |7                  |11165674  |830929 |0.07      |
    |8                  |11165674  |816054 |0.07      |
    |9                  |11165674  |810701 |0.07      |
    |10                 |11165674  |805091 |0.07      |
    |11                 |11165674  |756869 |0.07      |
    |12                 |11165674  |775535 |0.07      |
    +-------------------+----------+-------+----------+
    
    None
    +------------------+----------+-------+----------+
    |date_of_birth_year|total_rows|count  |proportion|
    +------------------+----------+-------+----------+
    |null              |11165674  |1438982|0.13      |
    |1079              |11165674  |1      |0.0       |
    |1089              |11165674  |1      |0.0       |
    |1355              |11165674  |1      |0.0       |
    |1609              |11165674  |1      |0.0       |
    |1654              |11165674  |1      |0.0       |
    |1696              |11165674  |1      |0.0       |
    |1698              |11165674  |3      |0.0       |
    |1775              |11165674  |1      |0.0       |
    |1790              |11165674  |1      |0.0       |
    |1795              |11165674  |1      |0.0       |
    |1864              |11165674  |1      |0.0       |
    |1875              |11165674  |1      |0.0       |
    |1885              |11165674  |1      |0.0       |
    |1897              |11165674  |2      |0.0       |
    |1898              |11165674  |1      |0.0       |
    |1899              |11165674  |1      |0.0       |
    |1900              |11165674  |19     |0.0       |
    |1901              |11165674  |6      |0.0       |
    |1903              |11165674  |3      |0.0       |
    +------------------+----------+-------+----------+
    
    None
    +----------------------------------------+----------+-----+----------+
    |etag                                    |total_rows|count|proportion|
    +----------------------------------------+----------+-----+----------+
    |null                                    |11165674  |1    |0.0       |
    |0000004904b433bcf0cf48d45280e7c95753bab5|11165674  |1    |0.0       |
    |000003d2b80da00c2b38780d09e23553d5b63787|11165674  |1    |0.0       |
    |000007d09b4a95b381542c05ac59da3f4f824a28|11165674  |1    |0.0       |
    |000008e7c9debfa43b3f6fe004e93175df3ddf1e|11165674  |1    |0.0       |
    |000008eaf57249f4133e19b56002f6c0b36b1361|11165674  |1    |0.0       |
    |000009b0b45bc791db15ee1338f52a424c2e16f4|11165674  |1    |0.0       |
    |00000d0f02871af168e4084e99a9e55ee135c99b|11165674  |1    |0.0       |
    |000013a839288556fa455c595712f4f05b21876c|11165674  |1    |0.0       |
    |00001598895cd06984aa146b512e120c24d175b0|11165674  |1    |0.0       |
    |000017dc191abf923b1e106b2b35e1dfc652fdb2|11165674  |1    |0.0       |
    |00001a6d63e36726bdedf695c58798633d21a072|11165674  |1    |0.0       |
    |00001cb3ca996e2c352cc5521f631817d58c643b|11165674  |1    |0.0       |
    |00001d0b7595254cbdfd48cd3b6c286b79229339|11165674  |1    |0.0       |
    |00001fa246acbc3825538df08e7e77f99e1953d2|11165674  |1    |0.0       |
    |00001fb1d231410c260d9f113efa6d0d48d4cda2|11165674  |1    |0.0       |
    |0000251b8be3822ffd18237133da62df4a045b64|11165674  |1    |0.0       |
    |000028ac4731566d5b451731b53ad0912c499029|11165674  |1    |0.0       |
    |000028e3e2b063fbe2b4ba32323fec8b0cb42c9d|11165674  |1    |0.0       |
    |00002ad805287f6473f2dd1178a0c4a71e641c37|11165674  |1    |0.0       |
    +----------------------------------------+----------+-----+----------+
    
    None
    +------------------------------------------------+----------+-------+----------+
    |kind                                            |total_rows|count  |proportion|
    +------------------------------------------------+----------+-------+----------+
    |corporate-entity-beneficial-owner               |11165674  |8280   |0.0       |
    |corporate-entity-person-with-significant-control|11165674  |754137 |0.07      |
    |exemptions                                      |11165674  |74     |0.0       |
    |individual-beneficial-owner                     |11165674  |20002  |0.0       |
    |individual-person-with-significant-control      |11165674  |9707114|0.87      |
    |legal-person-beneficial-owner                   |11165674  |356    |0.0       |
    |legal-person-person-with-significant-control    |11165674  |13119  |0.0       |
    |persons-with-significant-control-statement      |11165674  |662243 |0.06      |
    |super-secure-person-with-significant-control    |11165674  |348    |0.0       |
    |totals#persons-of-significant-control-snapshot  |11165674  |1      |0.0       |
    +------------------------------------------------+----------+-------+----------+
    
    None
    +-------------------------------------------------------------------------------------------------+----------+-----+----------+
    |links                                                                                            |total_rows|count|proportion|
    +-------------------------------------------------------------------------------------------------+----------+-----+----------+
    |null                                                                                             |11165674  |1    |0.0       |
    |{/company/00000086/persons-with-significant-control-statements/Jx8L7o209uDh1F8zJJ2qKorRwOc}      |11165674  |1    |0.0       |
    |{/company/00000118/persons-with-significant-control-statements/z7i5TYnBmWyFrG7AkPJBwr6UjfE}      |11165674  |1    |0.0       |
    |{/company/00000118/persons-with-significant-control/corporate-entity/wpoPoCDyGKSkIqg-IYm2fH_SyKo}|11165674  |1    |0.0       |
    |{/company/00000121/persons-with-significant-control-statements/F5C2_Xq_WX3CxXO8MqTBSaknNwg}      |11165674  |1    |0.0       |
    |{/company/00000133/persons-with-significant-control/corporate-entity/KEDWxS-rs5givHEdBMz1OJMKMu4}|11165674  |1    |0.0       |
    |{/company/00000140/persons-with-significant-control/legal-person/jLDoDjO8FrfvqENk9t6dITvkCs0}    |11165674  |1    |0.0       |
    |{/company/00000182/persons-with-significant-control/corporate-entity/zazZhwv3El93n8YEr0WyLQV_drE}|11165674  |1    |0.0       |
    |{/company/00000295/persons-with-significant-control/individual/E_53PisIXIw_18aMtlyVE9v2Okg}      |11165674  |1    |0.0       |
    |{/company/00000295/persons-with-significant-control/individual/tQVKwAdPtDSSk6eY-C3vuV3pZYY}      |11165674  |1    |0.0       |
    |{/company/00000371/persons-with-significant-control/individual/KobFaFniBauWoOvTWqlnIxW7AB0}      |11165674  |1    |0.0       |
    |{/company/00000402/persons-with-significant-control/corporate-entity/G_HfV07BhFNT3Ppu27hro1V1H5o}|11165674  |1    |0.0       |
    |{/company/00000452/persons-with-significant-control/individual/4MMoVHnHgJaZNjeMdcOSwOpja4I}      |11165674  |1    |0.0       |
    |{/company/00000452/persons-with-significant-control/individual/VEzGWVLMk9rGw_yt1QLKrgJeHaY}      |11165674  |1    |0.0       |
    |{/company/00000529/persons-with-significant-control/corporate-entity/BudQseMp7pgowMJlrLPCCxia9RA}|11165674  |1    |0.0       |
    |{/company/00000537/persons-with-significant-control/corporate-entity/ASY8o9g64OX414YRoAfXwj9cTdM}|11165674  |1    |0.0       |
    |{/company/00000599/persons-with-significant-control/corporate-entity/0qb1GkhVg4YXHo88ZlfhNzKD1X4}|11165674  |1    |0.0       |
    |{/company/00000687/persons-with-significant-control/corporate-entity/LYXJYcU_4dYL31o2xsJyn8drQxw}|11165674  |1    |0.0       |
    |{/company/00000827/persons-with-significant-control-statements/ZmZNQgs4DOANB2jnqe4QNjVbBWA}      |11165674  |1    |0.0       |
    |{/company/00000866/persons-with-significant-control/corporate-entity/gfIviBeZ1wc4xDPzn4lG-GHivs8}|11165674  |1    |0.0       |
    +-------------------------------------------------------------------------------------------------+----------+-----+----------+
    
    None
    
    Diagnostic GroupBy started(limit 20 records): 202302052019
    +--------------+----------+-------+----------+
    |nationality   |total_rows|count  |proportion|
    +--------------+----------+-------+----------+
    |null          |11165674  |1438563|0.13      |
    |,Indian       |11165674  |1      |0.0       |
    |-             |11165674  |1      |0.0       |
    |.             |11165674  |1      |0.0       |
    |.Indian       |11165674  |1      |0.0       |
    |13/05/1981    |11165674  |1      |0.0       |
    |515520 British|11165674  |1      |0.0       |
    |Abanian       |11165674  |1      |0.0       |
    |Accountant    |11165674  |1      |0.0       |
    |Acottish      |11165674  |1      |0.0       |
    |Afgahn        |11165674  |1      |0.0       |
    |Afganistan    |11165674  |3      |0.0       |
    |Afganistani   |11165674  |1      |0.0       |
    |Afghan        |11165674  |6519   |0.0       |
    |Afghan British|11165674  |1      |0.0       |
    |Afghan,       |11165674  |1      |0.0       |
    |Afghan,British|11165674  |9      |0.0       |
    |Afghan,Turkish|11165674  |1      |0.0       |
    |Afghani       |11165674  |2      |0.0       |
    |Afghanistan   |11165674  |11     |0.0       |
    +--------------+----------+-------+----------+
    
    None
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
    |natures_of_control                                                                                                                                                                                                                                                                                                                               |total_rows|count |proportion|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
    |null                                                                                                                                                                                                                                                                                                                                             |11165674  |662754|0.06      |
    |[ownership-of-shares-25-to-50-percent]                                                                                                                                                                                                                                                                                                           |11165674  |939481|0.08      |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm]                                                                                                                                                                                                                                                             |11165674  |3219  |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors]                                                                                                                                                                                                                      |11165674  |239   |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm]                                                                                                                                                                       |11165674  |26    |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control, significant-influence-or-control-as-firm]                                                                                           |11165674  |6     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm]                                                                                                                             |11165674  |2     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-trust]                                                                                                                            |11165674  |1     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust]                                                                                                                                                                      |11165674  |1     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm]                                                                                                                       |11165674  |20    |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control, significant-influence-or-control-as-trust, significant-influence-or-control-as-firm]|11165674  |5     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, right-to-appoint-and-remove-directors-as-trust, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm]                                                                             |11165674  |2     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control]                                                                                                                                                                                    |11165674  |17    |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control-as-firm]                                                                                                                                                                            |11165674  |16    |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors, significant-influence-or-control-as-trust]                                                                                                                                                                           |11165674  |1     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm]                                                                                                                                                                                                              |11165674  |10    |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control]                                                                                                                                                                            |11165674  |2     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-firm]                                                                                                                                                                    |11165674  |8     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-firm, significant-influence-or-control-as-trust]                                                                                                                                                                   |11165674  |1     |0.0       |
    |[ownership-of-shares-25-to-50-percent, ownership-of-shares-25-to-50-percent-as-firm, right-to-appoint-and-remove-directors-as-trust]                                                                                                                                                                                                             |11165674  |19    |0.0       |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------+----------+
    
    None
    
    Diagnostic GroupBy started(limit 20 records): 202302052125
    +-----------+----------+-----+----------+
    |notified_on|total_rows|count|proportion|
    +-----------+----------+-----+----------+
    |null       |11165674  |423  |0.0       |
    |1083-01-01 |11165674  |1    |0.0       |
    |1776-07-04 |11165674  |1    |0.0       |
    |1827-01-01 |11165674  |1    |0.0       |
    |1918-07-24 |11165674  |1    |0.0       |
    |1919-03-14 |11165674  |1    |0.0       |
    |1930-11-21 |11165674  |1    |0.0       |
    |1939-03-29 |11165674  |1    |0.0       |
    |1942-05-01 |11165674  |1    |0.0       |
    |1942-07-10 |11165674  |1    |0.0       |
    |1949-10-01 |11165674  |2    |0.0       |
    |1950-03-10 |11165674  |1    |0.0       |
    |1950-04-18 |11165674  |1    |0.0       |
    |1955-07-01 |11165674  |1    |0.0       |
    |1960-01-14 |11165674  |1    |0.0       |
    |1960-03-01 |11165674  |1    |0.0       |
    |1961-10-04 |11165674  |1    |0.0       |
    |1962-09-16 |11165674  |1    |0.0       |
    |1964-09-04 |11165674  |2    |0.0       |
    |1965-07-29 |11165674  |1    |0.0       |
    +-----------+----------+-----+----------+
    
    None
    Diagnostic Group By completed:  0:05:41.158327

    PySpark and BigQuery

    We have used PySpark to process snapshot JSON file and BigQuery for data storage.

    # https://download.companieshouse.gov.uk/en_pscdata.html
    def get_ch_psc_date():
      !wget https://download.companieshouse.gov.uk/persons-with-significant-control-snapshot-2023-02-04.zip
      !unzip persons-with-significant-control-snapshot-2023-02-04.zip -d /tmp/
      return True
    get_ch_psc_date()
    
    # reading JSON file
    df=spark.read.json('file:///tmp/persons-with-significant-control-snapshot-2023-02-04.txt',schema=schema)
    
    # create a temp table
    df.createOrReplaceTempView('ch_psc')
    
    # extract data
    df_psc = spark.sql(f"""
    select 
    company_number,
    data.address.address_line_1 as address_line_1,
    data.address.country  as country,
    data.address.locality as locality,
    data.address.postal_code as postal_code,
    data.address.premises as premises,
    data.country_of_residence as country_of_residence,
    data.date_of_birth.month as date_of_birth_month,
    data.date_of_birth.year as date_of_birth_year,
    data.etag as etag,
    data.kind as kind,
    data.links as links,
    data.name as name,
    data.name_elements.forename as forename,
    data.name_elements.middle_name as middle_name,
    data.name_elements.surname as surname,
    data.nationality as nationality,
    data.natures_of_control as natures_of_control,
    data.notified_on as notified_on
    from 
    ch_psc
    """)
    
    # Saving dataframe to BigQuery
    df_psc.write \
      .format("bigquery") \
      .option("temporaryGcsBucket", "ellodata/temp/") \
      .option("parentProject", "first-skein-329508") \
      .option("dataset", "utcdb") \
      .mode("overwrite") \
      .save("first-skein-329508.utcdb.ch_psc2023")
    Get PSC data
    Get PSC data
    Display PSC data
    Display PSC data
    PSC data by filter
    PSC data by filter