Another article compiled with help of Amazon Q. This time focusing on football ticket sales and trends analysis. The intension is to get some help to understand what dataset is required to build quick dashboard using QuickSight.
football ticket sales and trend analysis using fake data, amazon q, quicksight, sqlalchemy 2.0 and postgresql
Response from Q:
Certainly! I’ll provide you with a comprehensive solution for football ticket sales and trend analysis using fake data, Amazon QuickSight for visualization, SQLAlchemy 2.0 for database operations, and PostgreSQL as the database. We’ll use Python to generate fake data and perform initial analysis, then use QuickSight for more in-depth visualizations.
Here’s a step-by-step approach:
- First, let’s set up our environment and create our database model:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine, Column, Integer, Float, Date, Boolean, String, text
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy.exc import SQLAlchemyError
import random
# Create a base class for declarative models
Base = declarative_base()
# Define the FootballTicketSales model
class FootballTicketSales(Base):
__tablename__ = 'football_ticket_sales'
id = Column(Integer, primary_key=True)
date = Column(Date, nullable=False)
tickets_sold = Column(Integer, nullable=False)
average_ticket_price = Column(Float, nullable=False)
opponent_team = Column(String, nullable=False)
day_of_week = Column(String, nullable=False)
is_home_game = Column(Boolean, nullable=False)
weather_condition = Column(String, nullable=False)
total_revenue = Column(Float, nullable=False)
# Create an engine and session
engine = create_engine('postgresql://username:password@localhost/dbname')
Base.metadata.create_all(engine)
2. Now, let’s generate fake data and insert it into our database:
def generate_fake_data(start_date, end_date, num_teams=10):
date_range = pd.date_range(start=start_date, end=end_date)
teams = [f"Team {chr(65 + i)}" for i in range(num_teams)]
weather_conditions = ['Sunny', 'Rainy', 'Cloudy', 'Snowy']
data = []
for date in date_range:
if random.random() < 0.3: # Simulate games occurring on 30% of days
record = {
'date': date,
'tickets_sold': np.random.randint(5000, 50000),
'average_ticket_price': round(np.random.uniform(50, 200), 2),
'opponent_team': random.choice(teams),
'day_of_week': date.strftime('%A'),
'is_home_game': random.choice([True, False]),
'weather_condition': random.choice(weather_conditions)
}
record['total_revenue'] = record['tickets_sold'] * record['average_ticket_price']
data.append(record)
return pd.DataFrame(data)
# Generate fake data
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 12, 31)
df = generate_fake_data(start_date, end_date)
# Insert data into the database
with Session(engine) as session:
try:
for _, row in df.iterrows():
ticket_sale = FootballTicketSales(**row.to_dict())
session.add(ticket_sale)
session.commit()
print("Data inserted successfully")
except SQLAlchemyError as e:
session.rollback()
print(f"Error inserting data: {str(e)}")
3. Now that we have data in our database, let’s perform some initial analysis:
def perform_analysis():
with Session(engine) as session:
# Total revenue by year
yearly_revenue = session.execute(text("""
SELECT EXTRACT(YEAR FROM date) as year, SUM(total_revenue) as total_revenue
FROM football_ticket_sales
GROUP BY year
ORDER BY year
""")).fetchall()
# Average ticket price by opponent team
avg_price_by_team = session.execute(text("""
SELECT opponent_team, AVG(average_ticket_price) as avg_price
FROM football_ticket_sales
GROUP BY opponent_team
ORDER BY avg_price DESC
""")).fetchall()
# Ticket sales by day of week
sales_by_day = session.execute(text("""
SELECT day_of_week, AVG(tickets_sold) as avg_tickets_sold
FROM football_ticket_sales
GROUP BY day_of_week
ORDER BY avg_tickets_sold DESC
""")).fetchall()
# Home vs Away game comparison
home_away_comparison = session.execute(text("""
SELECT is_home_game, AVG(tickets_sold) as avg_tickets_sold, AVG(total_revenue) as avg_revenue
FROM football_ticket_sales
GROUP BY is_home_game
""")).fetchall()
# Weather impact on ticket sales
weather_impact = session.execute(text("""
SELECT weather_condition, AVG(tickets_sold) as avg_tickets_sold
FROM football_ticket_sales
GROUP BY weather_condition
ORDER BY avg_tickets_sold DESC
""")).fetchall()
return {
'yearly_revenue': yearly_revenue,
'avg_price_by_team': avg_price_by_team,
'sales_by_day': sales_by_day,
'home_away_comparison': home_away_comparison,
'weather_impact': weather_impact
}
# Perform analysis
analysis_results = perform_analysis()
# Print results
for key, value in analysis_results.items():
print(f"\n{key.replace('_', ' ').title()}:")
for row in value:
print(row)
- To use Amazon QuickSight for visualization, you’ll need to make your PostgreSQL database accessible to QuickSight. Here are the general steps:a. Ensure your PostgreSQL database is hosted on Amazon RDS or is otherwise accessible from AWS.b. Set up the necessary security groups and network access to allow QuickSight to connect to your database.c. In the QuickSight console, create a new data source and choose PostgreSQL.d. Enter your database connection details and test the connection.e. Once connected, you can create a new analysis based on your football_ticket_sales table.
- In QuickSight, you can create various visualizations based on the data:
- Line chart of ticket sales and revenue over time
- Bar chart of average ticket price by opponent team
- Pie chart of ticket sales by day of week
- Comparison bar chart of home vs away game performance
- Stacked bar chart of weather impact on ticket sales
- Heatmap of ticket sales by month and year
- Scatter plot of tickets sold vs average ticket price
- To create more advanced analyses in QuickSight:
- Use the “Forecast” feature to predict future ticket sales
- Create calculated fields for metrics like “Revenue per Ticket”
- Set up anomaly detection to automatically identify unusual sales patterns
- Create a dashboard combining multiple visualizations for a comprehensive view of your ticket sales data
Remember to regularly refresh your QuickSight datasets to ensure you’re working with the most up-to-date data from your PostgreSQL database.
This solution provides a comprehensive approach to analyzing football ticket sales data, from data generation and storage to analysis and visualization. The combination of SQLAlchemy for database operations and QuickSight for visualization allows for powerful, flexible analysis of your ticket sales trends.
Follow up question
update python script to insert large datasets
Certainly! To efficiently insert large datasets into your PostgreSQL database using SQLAlchemy 2.0, we’ll make some modifications to optimize performance. We’ll use batch inserts and the more efficient execute() method instead of individual ORM object creation. Here’s an updated version of the script that can handle large datasets more efficiently:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine, Column, Integer, Float, Date, Boolean, String, text, insert
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy.exc import SQLAlchemyError
import random
from tqdm import tqdm
# Create a base class for declarative models
Base = declarative_base()
# Define the FootballTicketSales model
class FootballTicketSales(Base):
__tablename__ = 'football_ticket_sales'
id = Column(Integer, primary_key=True)
date = Column(Date, nullable=False)
tickets_sold = Column(Integer, nullable=False)
average_ticket_price = Column(Float, nullable=False)
opponent_team = Column(String, nullable=False)
day_of_week = Column(String, nullable=False)
is_home_game = Column(Boolean, nullable=False)
weather_condition = Column(String, nullable=False)
total_revenue = Column(Float, nullable=False)
# Create an engine
engine = create_engine('postgresql://username:password@localhost/dbname')
Base.metadata.create_all(engine)
def generate_fake_data(start_date, end_date, num_teams=10, batch_size=10000):
date_range = pd.date_range(start=start_date, end=end_date)
teams = [f"Team {chr(65 + i)}" for i in range(num_teams)]
weather_conditions = ['Sunny', 'Rainy', 'Cloudy', 'Snowy']
total_days = len(date_range)
for i in range(0, total_days, batch_size):
batch_dates = date_range[i:i+batch_size]
data = []
for date in batch_dates:
if random.random() < 0.3: # Simulate games occurring on 30% of days
record = {
'date': date,
'tickets_sold': np.random.randint(5000, 50000),
'average_ticket_price': round(np.random.uniform(50, 200), 2),
'opponent_team': random.choice(teams),
'day_of_week': date.strftime('%A'),
'is_home_game': random.choice([True, False]),
'weather_condition': random.choice(weather_conditions)
}
record['total_revenue'] = record['tickets_sold'] * record['average_ticket_price']
data.append(record)
yield pd.DataFrame(data)
def insert_data_in_batches(df_generator, batch_size=1000):
total_inserted = 0
with engine.connect() as connection:
for df_batch in tqdm(df_generator, desc="Inserting batches"):
data = df_batch.to_dict(orient='records')
for i in range(0, len(data), batch_size):
batch = data[i:i+batch_size]
stmt = insert(FootballTicketSales).values(batch)
connection.execute(stmt)
connection.commit()
total_inserted += len(data)
print(f"Total records inserted: {total_inserted}")
# Generate and insert fake data
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 12, 31)
df_generator = generate_fake_data(start_date, end_date, num_teams=20, batch_size=10000)
try:
insert_data_in_batches(df_generator)
print("Data inserted successfully")
except SQLAlchemyError as e:
print(f"Error inserting data: {str(e)}")
# The analysis function remains the same as in the previous example
def perform_analysis():
with engine.connect() as connection:
# Total revenue by year
yearly_revenue = connection.execute(text("""
SELECT EXTRACT(YEAR FROM date) as year, SUM(total_revenue) as total_revenue
FROM football_ticket_sales
GROUP BY year
ORDER BY year
""")).fetchall()
# Average ticket price by opponent team
avg_price_by_team = connection.execute(text("""
SELECT opponent_team, AVG(average_ticket_price) as avg_price
FROM football_ticket_sales
GROUP BY opponent_team
ORDER BY avg_price DESC
""")).fetchall()
# Ticket sales by day of week
sales_by_day = connection.execute(text("""
SELECT day_of_week, AVG(tickets_sold) as avg_tickets_sold
FROM football_ticket_sales
GROUP BY day_of_week
ORDER BY avg_tickets_sold DESC
""")).fetchall()
# Home vs Away game comparison
home_away_comparison = connection.execute(text("""
SELECT is_home_game, AVG(tickets_sold) as avg_tickets_sold, AVG(total_revenue) as avg_revenue
FROM football_ticket_sales
GROUP BY is_home_game
""")).fetchall()
# Weather impact on ticket sales
weather_impact = connection.execute(text("""
SELECT weather_condition, AVG(tickets_sold) as avg_tickets_sold
FROM football_ticket_sales
GROUP BY weather_condition
ORDER BY avg_tickets_sold DESC
""")).fetchall()
return {
'yearly_revenue': yearly_revenue,
'avg_price_by_team': avg_price_by_team,
'sales_by_day': sales_by_day,
'home_away_comparison': home_away_comparison,
'weather_impact': weather_impact
}
# Perform analysis
analysis_results = perform_analysis()
# Print results
for key, value in analysis_results.items():
print(f"\n{key.replace('_', ' ').title()}:")
for row in value:
print(row)
Key changes and optimizations in this updated script:
- We’ve modified the
generate_fake_data
function to yield batches of data instead of generating all data at once. This helps manage memory usage for large datasets.- We’ve introduced a new
insert_data_in_batches
function that efficiently inserts data in batches using SQLAlchemy’sinsert()
method.- We’re using
engine.connect()
instead of creating a Session, which is more efficient for bulk inserts.- We’ve added a progress bar using
tqdm
to show the progress of data insertion.- The script now commits after each batch insertion, which helps manage transaction size and reduces the risk of transaction timeouts.
- We’ve increased the number of teams and the date range to generate a larger dataset.
- The analysis function now uses
engine.connect()
instead of creating a Session, which is more appropriate for read-only operations.This updated script should be able to handle much larger datasets more efficiently. It generates data in batches and inserts them in smaller chunks, which helps manage memory usage and improves overall performance.
To use this with Amazon QuickSight:
- After running this script, your PostgreSQL database will be populated with a large dataset of football ticket sales.
- In the Amazon QuickSight console, you can connect to your PostgreSQL database as described in the previous response.
- When creating your analysis in QuickSight, you may want to consider using “Direct Query” mode instead of SPICE, especially if your dataset is very large. This allows QuickSight to query your PostgreSQL database directly, which can be beneficial for large, frequently updated datasets.
- You can then create the visualizations and dashboards as described earlier, leveraging QuickSight’s ability to handle large datasets efficiently.
Remember to monitor your database performance and adjust your PostgreSQL configuration if needed when working with very large datasets. You might need to tune parameters like work_mem, maintenance_work_mem, and max_connections for optimal performance with large data volumes.