In this article, we would extract SIC codes from https://www.ons.gov.uk/file?uri=/methodology/classificationsandstandards/ukstandardindustrialclassificationofeconomicactivities/uksic2007/uksic2007indexeswithaddendumnovember2020.xlsx and save it as parquet file for further processing.
We are only interested in data from worksheet ‘Alphabetical Index’.
Steps:
- Download the file from ONS website
- Extract SIC. codes
- Save as parquet file
import pandas as pd sic_pd = pd.read_excel('spark-warehouse/uksic2007indexeswithaddendumnovember2020.xlsx',\ sheet_name="Alphabetical Index", skiprows=1) df_sic = spark.createDataFrame(sic_pd) df_sic.coalesce(1).write.format("parquet").mode("overwrite").save('spark-warehouse/sic0307')