UPDATE – 07/10/2022 using sparkContext.range()
You can use sparkContext.range() function to generate rows and then use withColumn for adding variables to a dataframe. It generate ‘a column with name ‘id’. Drop it you do not need it using df.drop(‘id’).
In this article, we will show way of generating synthetic data using HIVE SQL. You can also use Python packages like Faker or Mimesis. However we decided to use SQL.
You have to take reasonable steps to ensure that the datasets are synthetic. Carry out basic data analysis and understand the values for each variables in your datasets. Use SQL functions such as listed below to populate your variables:
- RAND() for number – i.e floor(rand() * 10) – a random number between 1 and 10
- java.util.util.RandomUIID for text
- regexp_replace(upper,(reflect(“java.util.UUID”, “randomUUID”)),”-“,”)
- concat
- substr
- use six or eight known postcodes and use CASE WHEN rand() > x THEN y to populate values for postcode variable
- date variables – ensure startdate is less than enddate
- startdate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * -30))), “yyyy-MM-dd”)
- endate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * 30))), “yyyy-MM-dd”)
- use posexplode to generate rows
- for example select current_date() as date from (select 9 as start_r, 0 as end_r) r lateral view posexplode(split(space(start_r – end_r),’ ‘)) pe as i,s – this will generate 10 rows
PySpark code snippets for generating synthetic data using HIVE SQL
pastdate = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() * -30) as int))), "yyyy-MM-dd")"""
date = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() * 30) as int))), "yyyy-MM-dd")"""
crn = "floor(rand() * 1111111 + 90000000)"
randomtext = """regexp_replace(upper(reflect("java.util.UUID", "randomUUID")),"-",'')"""
spark.sql(f"""
select
{crn} as companynumber,
{randomtext} as companyname,
{pastdate} as incorporationdate,
CASE
WHEN rand() > 0.1 THEN NULL
ELSE {date}
END as dissolveddate
from (select 9 as start_r, 0 as end_r) r
lateral view posexplode(split(space(start_r - end_r),' ')) pe as i,s
""").show(truncate=False)