BroadOakData.UK

Ticket sales and trend analysis using fake data, Amazon Q and QuickSight

Written by

in

Another article compiled with help of Amazon Q. This time focusing on football ticket sales and trends analysis. The intension is to get some help to understand what dataset is required to build quick dashboard using QuickSight.

football ticket sales and trend analysis using fake data, amazon q, quicksight, sqlalchemy 2.0 and postgresql

Response from Q:

Certainly! I’ll provide you with a comprehensive solution for football ticket sales and trend analysis using fake data, Amazon QuickSight for visualization, SQLAlchemy 2.0 for database operations, and PostgreSQL as the database. We’ll use Python to generate fake data and perform initial analysis, then use QuickSight for more in-depth visualizations.

Here’s a step-by-step approach:

First, let’s set up our environment and create our database model:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine, Column, Integer, Float, Date, Boolean, String, text
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy.exc import SQLAlchemyError
import random

# Create a base class for declarative models
Base = declarative_base()

# Define the FootballTicketSales model
class FootballTicketSales(Base):
    __tablename__ = 'football_ticket_sales'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    tickets_sold = Column(Integer, nullable=False)
    average_ticket_price = Column(Float, nullable=False)
    opponent_team = Column(String, nullable=False)
    day_of_week = Column(String, nullable=False)
    is_home_game = Column(Boolean, nullable=False)
    weather_condition = Column(String, nullable=False)
    total_revenue = Column(Float, nullable=False)

# Create an engine and session
engine = create_engine('postgresql://username:password@localhost/dbname')
Base.metadata.create_all(engine)

2. Now, let’s generate fake data and insert it into our database:

def generate_fake_data(start_date, end_date, num_teams=10):
    date_range = pd.date_range(start=start_date, end=end_date)
    teams = [f"Team {chr(65 + i)}" for i in range(num_teams)]
    weather_conditions = ['Sunny', 'Rainy', 'Cloudy', 'Snowy']
    
    data = []
    for date in date_range:
        if random.random() < 0.3:  # Simulate games occurring on 30% of days
            record = {
                'date': date,
                'tickets_sold': np.random.randint(5000, 50000),
                'average_ticket_price': round(np.random.uniform(50, 200), 2),
                'opponent_team': random.choice(teams),
                'day_of_week': date.strftime('%A'),
                'is_home_game': random.choice([True, False]),
                'weather_condition': random.choice(weather_conditions)
            }
            record['total_revenue'] = record['tickets_sold'] * record['average_ticket_price']
            data.append(record)
    
    return pd.DataFrame(data)

# Generate fake data
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 12, 31)
df = generate_fake_data(start_date, end_date)

# Insert data into the database
with Session(engine) as session:
    try:
        for _, row in df.iterrows():
            ticket_sale = FootballTicketSales(**row.to_dict())
            session.add(ticket_sale)
        session.commit()
        print("Data inserted successfully")
    except SQLAlchemyError as e:
        session.rollback()
        print(f"Error inserting data: {str(e)}")

3. Now that we have data in our database, let’s perform some initial analysis:

def perform_analysis():
    with Session(engine) as session:
        # Total revenue by year
        yearly_revenue = session.execute(text("""
            SELECT EXTRACT(YEAR FROM date) as year, SUM(total_revenue) as total_revenue
            FROM football_ticket_sales
            GROUP BY year
            ORDER BY year
        """)).fetchall()
        
        # Average ticket price by opponent team
        avg_price_by_team = session.execute(text("""
            SELECT opponent_team, AVG(average_ticket_price) as avg_price
            FROM football_ticket_sales
            GROUP BY opponent_team
            ORDER BY avg_price DESC
        """)).fetchall()
        
        # Ticket sales by day of week
        sales_by_day = session.execute(text("""
            SELECT day_of_week, AVG(tickets_sold) as avg_tickets_sold
            FROM football_ticket_sales
            GROUP BY day_of_week
            ORDER BY avg_tickets_sold DESC
        """)).fetchall()
        
        # Home vs Away game comparison
        home_away_comparison = session.execute(text("""
            SELECT is_home_game, AVG(tickets_sold) as avg_tickets_sold, AVG(total_revenue) as avg_revenue
            FROM football_ticket_sales
            GROUP BY is_home_game
        """)).fetchall()
        
        # Weather impact on ticket sales
        weather_impact = session.execute(text("""
            SELECT weather_condition, AVG(tickets_sold) as avg_tickets_sold
            FROM football_ticket_sales
            GROUP BY weather_condition
            ORDER BY avg_tickets_sold DESC
        """)).fetchall()
        
        return {
            'yearly_revenue': yearly_revenue,
            'avg_price_by_team': avg_price_by_team,
            'sales_by_day': sales_by_day,
            'home_away_comparison': home_away_comparison,
            'weather_impact': weather_impact
        }

# Perform analysis
analysis_results = perform_analysis()

# Print results
for key, value in analysis_results.items():
    print(f"\n{key.replace('_', ' ').title()}:")
    for row in value:
        print(row)

To use Amazon QuickSight for visualization, you’ll need to make your PostgreSQL database accessible to QuickSight. Here are the general steps:a. Ensure your PostgreSQL database is hosted on Amazon RDS or is otherwise accessible from AWS.b. Set up the necessary security groups and network access to allow QuickSight to connect to your database.c. In the QuickSight console, create a new data source and choose PostgreSQL.d. Enter your database connection details and test the connection.e. Once connected, you can create a new analysis based on your football_ticket_sales table.

In QuickSight, you can create various visualizations based on the data:

Line chart of ticket sales and revenue over time

Bar chart of average ticket price by opponent team

Pie chart of ticket sales by day of week

Comparison bar chart of home vs away game performance

Stacked bar chart of weather impact on ticket sales

Heatmap of ticket sales by month and year

Scatter plot of tickets sold vs average ticket price

To create more advanced analyses in QuickSight:

Use the “Forecast” feature to predict future ticket sales

Create calculated fields for metrics like “Revenue per Ticket”

Set up anomaly detection to automatically identify unusual sales patterns

Create a dashboard combining multiple visualizations for a comprehensive view of your ticket sales data

Remember to regularly refresh your QuickSight datasets to ensure you’re working with the most up-to-date data from your PostgreSQL database.

This solution provides a comprehensive approach to analyzing football ticket sales data, from data generation and storage to analysis and visualization. The combination of SQLAlchemy for database operations and QuickSight for visualization allows for powerful, flexible analysis of your ticket sales trends.

Follow up question

update python script to insert large datasets

Certainly! To efficiently insert large datasets into your PostgreSQL database using SQLAlchemy 2.0, we’ll make some modifications to optimize performance. We’ll use batch inserts and the more efficient execute() method instead of individual ORM object creation. Here’s an updated version of the script that can handle large datasets more efficiently:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine, Column, Integer, Float, Date, Boolean, String, text, insert
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy.exc import SQLAlchemyError
import random
from tqdm import tqdm

# Create a base class for declarative models
Base = declarative_base()

# Define the FootballTicketSales model
class FootballTicketSales(Base):
    __tablename__ = 'football_ticket_sales'
    
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    tickets_sold = Column(Integer, nullable=False)
    average_ticket_price = Column(Float, nullable=False)
    opponent_team = Column(String, nullable=False)
    day_of_week = Column(String, nullable=False)
    is_home_game = Column(Boolean, nullable=False)
    weather_condition = Column(String, nullable=False)
    total_revenue = Column(Float, nullable=False)

# Create an engine
engine = create_engine('postgresql://username:password@localhost/dbname')
Base.metadata.create_all(engine)

def generate_fake_data(start_date, end_date, num_teams=10, batch_size=10000):
    date_range = pd.date_range(start=start_date, end=end_date)
    teams = [f"Team {chr(65 + i)}" for i in range(num_teams)]
    weather_conditions = ['Sunny', 'Rainy', 'Cloudy', 'Snowy']
    
    total_days = len(date_range)
    
    for i in range(0, total_days, batch_size):
        batch_dates = date_range[i:i+batch_size]
        data = []
        for date in batch_dates:
            if random.random() < 0.3:  # Simulate games occurring on 30% of days
                record = {
                    'date': date,
                    'tickets_sold': np.random.randint(5000, 50000),
                    'average_ticket_price': round(np.random.uniform(50, 200), 2),
                    'opponent_team': random.choice(teams),
                    'day_of_week': date.strftime('%A'),
                    'is_home_game': random.choice([True, False]),
                    'weather_condition': random.choice(weather_conditions)
                }
                record['total_revenue'] = record['tickets_sold'] * record['average_ticket_price']
                data.append(record)
        
        yield pd.DataFrame(data)

def insert_data_in_batches(df_generator, batch_size=1000):
    total_inserted = 0
    with engine.connect() as connection:
        for df_batch in tqdm(df_generator, desc="Inserting batches"):
            data = df_batch.to_dict(orient='records')
            for i in range(0, len(data), batch_size):
                batch = data[i:i+batch_size]
                stmt = insert(FootballTicketSales).values(batch)
                connection.execute(stmt)
            connection.commit()
            total_inserted += len(data)
    
    print(f"Total records inserted: {total_inserted}")

# Generate and insert fake data
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 12, 31)
df_generator = generate_fake_data(start_date, end_date, num_teams=20, batch_size=10000)

try:
    insert_data_in_batches(df_generator)
    print("Data inserted successfully")
except SQLAlchemyError as e:
    print(f"Error inserting data: {str(e)}")

# The analysis function remains the same as in the previous example
def perform_analysis():
    with engine.connect() as connection:
        # Total revenue by year
        yearly_revenue = connection.execute(text("""
            SELECT EXTRACT(YEAR FROM date) as year, SUM(total_revenue) as total_revenue
            FROM football_ticket_sales
            GROUP BY year
            ORDER BY year
        """)).fetchall()
        
        # Average ticket price by opponent team
        avg_price_by_team = connection.execute(text("""
            SELECT opponent_team, AVG(average_ticket_price) as avg_price
            FROM football_ticket_sales
            GROUP BY opponent_team
            ORDER BY avg_price DESC
        """)).fetchall()
        
        # Ticket sales by day of week
        sales_by_day = connection.execute(text("""
            SELECT day_of_week, AVG(tickets_sold) as avg_tickets_sold
            FROM football_ticket_sales
            GROUP BY day_of_week
            ORDER BY avg_tickets_sold DESC
        """)).fetchall()
        
        # Home vs Away game comparison
        home_away_comparison = connection.execute(text("""
            SELECT is_home_game, AVG(tickets_sold) as avg_tickets_sold, AVG(total_revenue) as avg_revenue
            FROM football_ticket_sales
            GROUP BY is_home_game
        """)).fetchall()
        
        # Weather impact on ticket sales
        weather_impact = connection.execute(text("""
            SELECT weather_condition, AVG(tickets_sold) as avg_tickets_sold
            FROM football_ticket_sales
            GROUP BY weather_condition
            ORDER BY avg_tickets_sold DESC
        """)).fetchall()
        
        return {
            'yearly_revenue': yearly_revenue,
            'avg_price_by_team': avg_price_by_team,
            'sales_by_day': sales_by_day,
            'home_away_comparison': home_away_comparison,
            'weather_impact': weather_impact
        }

# Perform analysis
analysis_results = perform_analysis()

# Print results
for key, value in analysis_results.items():
    print(f"\n{key.replace('_', ' ').title()}:")
    for row in value:
        print(row)

Key changes and optimizations in this updated script:

We’ve modified the generate_fake_data function to yield batches of data instead of generating all data at once. This helps manage memory usage for large datasets.

We’ve introduced a new insert_data_in_batches function that efficiently inserts data in batches using SQLAlchemy’s insert() method.

We’re using engine.connect() instead of creating a Session, which is more efficient for bulk inserts.

We’ve added a progress bar using tqdm to show the progress of data insertion.

The script now commits after each batch insertion, which helps manage transaction size and reduces the risk of transaction timeouts.

We’ve increased the number of teams and the date range to generate a larger dataset.

The analysis function now uses engine.connect() instead of creating a Session, which is more appropriate for read-only operations.

This updated script should be able to handle much larger datasets more efficiently. It generates data in batches and inserts them in smaller chunks, which helps manage memory usage and improves overall performance.

To use this with Amazon QuickSight:

After running this script, your PostgreSQL database will be populated with a large dataset of football ticket sales.

In the Amazon QuickSight console, you can connect to your PostgreSQL database as described in the previous response.

When creating your analysis in QuickSight, you may want to consider using “Direct Query” mode instead of SPICE, especially if your dataset is very large. This allows QuickSight to query your PostgreSQL database directly, which can be beneficial for large, frequently updated datasets.

You can then create the visualizations and dashboards as described earlier, leveraging QuickSight’s ability to handle large datasets efficiently.

Remember to monitor your database performance and adjust your PostgreSQL configuration if needed when working with very large datasets. You might need to tune parameters like work_mem, maintenance_work_mem, and max_connections for optimal performance with large data volumes.

More posts