Tag: PySpark

  • Using AWK to process Hive tables metadata and Hadoop file listing output

    Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep.

    Commands

    • SQL – show table extended from {database} like {tablename}
    • HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } | awk ‘{if($5 >0) print $6,$7,$8}’

    Two examples to get you started

    $hdfs dfs -ls -R ~/spark-warehouse/ | grep "companies" | awk '{if($5>0) print $6,$7,$8}'
    df3 = spark.sql("show table extended in default like '*companies*'")
    df3.select('information').show(truncate=False)
    
    |Database: default
    Table: companieshouse_etl_log
    Owner: aliakkas
    Created Time: Tue Jan 18 01:02:10 GMT 2022
    Last Access: UNKNOWN
    Created By: Spark 3.1.2
    Type: EXTERNAL
    Provider: hive
    Table Properties: [transient_lastDdlTime=1642467730]
    Statistics: 369 bytes
    Location: file:/Users/aliakkas/apps/log
    Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat: org.apache.hadoop.mapred.TextInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Storage Properties: [serialization.format=^, field.delim=^]
    Partition Provider: Catalog
    Schema: root
     |-- logdate: string (nullable = true)
     |-- logtype: string (nullable = true)
     |-- logmessage: string (nullable = true)

    More articles and other useful information see:  https://gitlab.com/akkasali910/companieshouse-data and read related articles at https://broadoakdata.uk/?s=Companies+House