Tag: PySpark

  • Synthetic data creation using mimesys and pyspark

    In the previous article we described how to generate synthetic data using HIVE SQL. However in this we will use python package mimesys together with pyspark for synthetic data.

    Prerequisite for generating synthetic data

    • mimesys
    • pyspark

    Create dataframe with 10 rows

    For example, we will create a dataframe with 10 rows. See code snippets below:

    df = spark.range(10)
    df.show()
    example

    Synthetic data using mimesys

    from pyspark.sql import SparkSession
    from mimesis import Generic
    import itertools
    
    # taken from farsante package
    def pyspark_df(funs, num_rows, spark = SparkSession.builder.getOrCreate()):
        def functions():
            return tuple(map(lambda f: f(), funs))
        cols = list(map(lambda f: f.__name__, funs))
        data = []
        for _ in itertools.repeat(None, num_rows):
            data.append(functions())
        return spark.createDataFrame(data, cols)
    
    # generate data
    fake = Generic()
    df_test = pyspark_df([fake.person.full_name,fake.finance.company,fake.address.street_name],10)
    generate data using pyspark and mimesys