Generating synthetic data using HIVE SQL

UPDATE – 07/10/2022 using sparkContext.range()

You can use sparkContext.range() function to generate rows and then use withColumn for adding variables to a dataframe. It generate ‘a column with name ‘id’. Drop it you do not need it using df.drop(‘id’).

In this article, we will show way of generating synthetic data using HIVE SQL. You can also use Python packages like Faker or Mimesis. However we decided to use SQL.

You have to take reasonable steps to ensure that the datasets are synthetic. Carry out basic data analysis and understand the values for each variables in your datasets. Use SQL functions such as listed below to populate your variables:

  • RAND() for number – i.e floor(rand() * 10) – a random number between 1 and 10
  • java.util.util.RandomUIID for text
    • regexp_replace(upper,(reflect(“java.util.UUID”, “randomUUID”)),”-“,”)
    • concat
    • substr
  • use six or eight known postcodes and use CASE WHEN rand() > x THEN y to populate values for postcode variable
  • date variables – ensure startdate is less than enddate
    • startdate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * -30))), “yyyy-MM-dd”)
    • endate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * 30))), “yyyy-MM-dd”)
  • use posexplode to generate rows
    • for example select current_date() as date from (select 9 as start_r, 0 as end_r) r lateral view posexplode(split(space(start_r – end_r),’ ‘)) pe as i,s – this will generate 10 rows

PySpark code snippets for generating synthetic data using HIVE SQL

pastdate = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() *  -30) as int))),  "yyyy-MM-dd")"""
date = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() *  30) as int))),  "yyyy-MM-dd")"""
crn = "floor(rand() * 1111111 + 90000000)"
randomtext = """regexp_replace(upper(reflect("java.util.UUID", "randomUUID")),"-",'')"""

spark.sql(f"""
select 
{crn} as companynumber,
{randomtext} as companyname,
{pastdate} as incorporationdate,
CASE
  WHEN rand() > 0.1 THEN NULL
  ELSE {date}
END as dissolveddate
from (select 9 as start_r, 0 as end_r) r 
lateral view posexplode(split(space(start_r - end_r),' ')) pe as i,s
""").show(truncate=False)
Synthetic  data
Output of synthetic data