Tag: aws

  • AWS Data Wrangler (AWS SDK for Pandas)

    AWS Data Wrangler (awswrangler) is now called AWS SDK for Pandas), is a python library which make it easier to integrate AWS services for ETL (Extract Transform Load) works.

    I will use Google Colab and AWS SDK for Pandas for creating:

    • database in AWS Data Catalog
    • tables

    Prerequisites

    • S3 bucket created and data exist (using charity data)
    • Necessary permissions to use AWS services such as AWS Glue
    • access to Google CoLab

    Steps

    Let’s get started.

    • install awswrangler (pip install awswrangler)
    • create a database in AWS Data Catalog – s3.catalog.create_database(‘database_name’)
    • crate table – s3.store_parquet_metadata
    # create database before parquet_metadata
    # EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the CreateTable operation: Database ellodb not found.
    wr.catalog.create_database('database_name')
    
    # create table with existing data in S3
    # very fast!!!
    columns_types, partitions_types, partitions_values = wr.s3.store_parquet_metadata(
        path='s3://{bucket_ame}/data/charity/',
        database='database_name',
        table='charity',
        dataset=True
    )

    Verify your table

    Go to AWS Console and check your table metadata under Data Catalog

    AWS Data Catalog
    Schema for charity

    Read data from S3 bucket

    df = wr.s3.read_parquet("s3://{bucket_name}data/charity/")

    Create a table using Pandas dataframe

    You can also create a table using – s3.to_parquet method

    # create table using Pandas dataframe
    # https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.to_parquet.html
    wr.s3.to_parquet(
        df=df_date,
        path="s3://{bucket_name}/data/charity/",
        datTrue,
        database="database_name",
        table="table_name",
        partition_cols=["column_name"]
    );

    Read data from a database

    Using Pandas library to read from a relational database instead of using data from Data Catalog

    from sqlalchemy import create_engine, text
    # postgres
    user = 'user'
    password = 'password'
    database = 'database_name'
    url = f'postgresql://{user}:{password}@xxxxx.aws.neon.tech/{database}'
    pgengine = create_engine(url)
    pgconn = pgengine.connect()
    
    # get data using query
    query = text('SELECT version();')
    df = pd.read_sql_query(query, pgconn)

    AWS SDK Panadas – Tutorials

    Go to AWS SSDK Pandas turorial site