BroadOakData.UK

Tag: PySpark

Using AWK to process Hive tables metadata and Hadoop file listing output

Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep.

Commands

SQL – show table extended from {database} like {tablename}
HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } | awk ‘{if($5 >0) print $6,$7,$8}’

Two examples to get you started

$hdfs dfs -ls -R ~/spark-warehouse/ | grep "companies" | awk '{if($5>0) print $6,$7,$8}'

df3 = spark.sql("show table extended in default like '*companies*'")
df3.select('information').show(truncate=False)

|Database: default
Table: companieshouse_etl_log
Owner: aliakkas
Created Time: Tue Jan 18 01:02:10 GMT 2022
Last Access: UNKNOWN
Created By: Spark 3.1.2
Type: EXTERNAL
Provider: hive
Table Properties: [transient_lastDdlTime=1642467730]
Statistics: 369 bytes
Location: file:/Users/aliakkas/apps/log
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=^, field.delim=^]
Partition Provider: Catalog
Schema: root
 |-- logdate: string (nullable = true)
 |-- logtype: string (nullable = true)
 |-- logmessage: string (nullable = true)

More articles and other useful information see: https://gitlab.com/akkasali910/companieshouse-data and read related articles at https://broadoakdata.uk/?s=Companies+House

January 20, 2022