Synthetic data creation using mimesys and pyspark

In the previous article we described how to generate synthetic data using HIVE SQL. However in this we will use python package mimesys together with pyspark for synthetic data.

Prerequisite for generating synthetic data

  • mimesys
  • pyspark

Create dataframe with 10 rows

For example, we will create a dataframe with 10 rows. See code snippets below:

df = spark.range(10)
df.show()
example

Synthetic data using mimesys

from pyspark.sql import SparkSession
from mimesis import Generic
import itertools

# taken from farsante package
def pyspark_df(funs, num_rows, spark = SparkSession.builder.getOrCreate()):
    def functions():
        return tuple(map(lambda f: f(), funs))
    cols = list(map(lambda f: f.__name__, funs))
    data = []
    for _ in itertools.repeat(None, num_rows):
        data.append(functions())
    return spark.createDataFrame(data, cols)

# generate data
fake = Generic()
df_test = pyspark_df([fake.person.full_name,fake.finance.company,fake.address.street_name],10)
generate data using pyspark and mimesys