Following on from previous article related to Companies House Data, in this article we will show how to extract metadata from HIve tables and Hadoop filesystem using commands like hdfs, awk and grep.
Commands
- SQL – show table extended from {database} like {tablename}
- HADOOP – hdfs dfs -ls -R {dirname} | grep {searchstr} } | awk ‘{if($5 >0) print $6,$7,$8}’
Two examples to get you started
$hdfs dfs -ls -R ~/spark-warehouse/ | grep "companies" | awk '{if($5>0) print $6,$7,$8}'
df3 = spark.sql("show table extended in default like '*companies*'") df3.select('information').show(truncate=False) |Database: default Table: companieshouse_etl_log Owner: aliakkas Created Time: Tue Jan 18 01:02:10 GMT 2022 Last Access: UNKNOWN Created By: Spark 3.1.2 Type: EXTERNAL Provider: hive Table Properties: [transient_lastDdlTime=1642467730] Statistics: 369 bytes Location: file:/Users/aliakkas/apps/log Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=^, field.delim=^] Partition Provider: Catalog Schema: root |-- logdate: string (nullable = true) |-- logtype: string (nullable = true) |-- logmessage: string (nullable = true)
More articles and other useful information see: https://gitlab.com/akkasali910/companieshouse-data and read related articles at https://broadoakdata.uk/?s=Companies+House