UPDATE – 07/10/2022 using sparkContext.range()

You can use sparkContext.range() function to generate rows and then use withColumn for adding variables to a dataframe. It generate ‘a column with name ‘id’. Drop it you do not need it using df.drop(‘id’).

In this article, we will show way of generating synthetic data using HIVE SQL. You can also use Python packages like Faker or Mimesis. However we decided to use SQL.

You have to take reasonable steps to ensure that the datasets are synthetic. Carry out basic data analysis and understand the values for each variables in your datasets. Use SQL functions such as listed below to populate your variables:

RAND() for number – i.e floor(rand() * 10) – a random number between 1 and 10
java.util.util.RandomUIID for text
- regexp_replace(upper,(reflect(“java.util.UUID”, “randomUUID”)),”-“,”)
- concat
- substr
use six or eight known postcodes and use CASE WHEN rand() > x THEN y to populate values for postcode variable
date variables – ensure startdate is less than enddate
- startdate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * -30))), “yyyy-MM-dd”)
- endate = from_unixtime(unix_timestamp(date_add(current_date(), floor(rand() * 30))), “yyyy-MM-dd”)
use posexplode to generate rows
- for example select current_date() as date from (select 9 as start_r, 0 as end_r) r lateral view posexplode(split(space(start_r – end_r),’ ‘)) pe as i,s – this will generate 10 rows

PySpark code snippets for generating synthetic data using HIVE SQL

pastdate = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() *  -30) as int))),  "yyyy-MM-dd")"""
date = """from_unixtime(unix_timestamp(date_add(current_date(), cast(floor(rand() *  30) as int))),  "yyyy-MM-dd")"""
crn = "floor(rand() * 1111111 + 90000000)"
randomtext = """regexp_replace(upper(reflect("java.util.UUID", "randomUUID")),"-",'')"""

spark.sql(f"""
select 
{crn} as companynumber,
{randomtext} as companyname,
{pastdate} as incorporationdate,
CASE
  WHEN rand() > 0.1 THEN NULL
  ELSE {date}
END as dissolveddate
from (select 9 as start_r, 0 as end_r) r 
lateral view posexplode(split(space(start_r - end_r),' ')) pe as i,s
""").show(truncate=False)

Generating synthetic data using HIVE SQL

UPDATE – 07/10/2022 using sparkContext.range()

PySpark code snippets for generating synthetic data using HIVE SQL

More posts

Using AWS Cloudscape Design System to create a simple cost analysis dashboard

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Customer segmentation analysis using fake data and QuickSight with Amazon Q

Migrating data from SQL Server to PostgreSQL