Extracting Standard Industrial Classification (SIC) codes from ONS web site

In this article, we would extract SIC codes from https://www.ons.gov.uk/file?uri=/methodology/classificationsandstandards/ukstandardindustrialclassificationofeconomicactivities/uksic2007/uksic2007indexeswithaddendumnovember2020.xlsx and save it as parquet file for further processing.

We are only interested in data from worksheet ‘Alphabetical Index’.

Steps:

  • Download the file from ONS website
  • Extract SIC. codes
  • Save as parquet file
import pandas as pd
sic_pd = pd.read_excel('spark-warehouse/uksic2007indexeswithaddendumnovember2020.xlsx',\
                        sheet_name="Alphabetical Index", skiprows=1)
df_sic = spark.createDataFrame(sic_pd)
df_sic.coalesce(1).write.format("parquet").mode("overwrite").save('spark-warehouse/sic0307')