In the previous article we described how to generate synthetic data using HIVE SQL. However in this we will use python package mimesys together with pyspark for synthetic data.
Prerequisite for generating synthetic data
- mimesys
- pyspark
Create dataframe with 10 rows
For example, we will create a dataframe with 10 rows. See code snippets below:
df = spark.range(10)
df.show()
Synthetic data using mimesys
from pyspark.sql import SparkSession
from mimesis import Generic
import itertools
# taken from farsante package
def pyspark_df(funs, num_rows, spark = SparkSession.builder.getOrCreate()):
def functions():
return tuple(map(lambda f: f(), funs))
cols = list(map(lambda f: f.__name__, funs))
data = []
for _ in itertools.repeat(None, num_rows):
data.append(functions())
return spark.createDataFrame(data, cols)
# generate data
fake = Generic()
df_test = pyspark_df([fake.person.full_name,fake.finance.company,fake.address.street_name],10)