Introduction
In the race to adopt Generative AI, developers often face a trade-off between latency/cost and intelligence. Pure cloud solutions (sending everything to OpenAI or AWS Bedrock) can become expensive and slow due to network latency. Pure local solutions (running LLMs on a CPU) lack the reasoning power of frontier models.
This article details the architecture of a Hybrid Agentic System that solves this problem. We built a predictive dashboard using Streamlit that orchestrates three specialized agents. It utilizes local Hugging Face embeddings for speed and zero cost, paired with AWS Bedrock (Claude 3 Sonnet) for high-level synthesis, all deployed on AWS Lambda via Docker containers.
The Architecture: “The Supervisor Pattern”
Instead of a single monolithic LLM prompt, we employed a Supervisor-Worker multi-agent architecture.
- The Supervisor (Orchestrator): A Python class that manages state, handles errors, and dictates the sequence of execution.
- Agent 1: The Forecaster (Deterministic): A specialized Python function that connects to an ML endpoint (or runs a simulation) to predict revenue based on market segments.
- Agent 2: The Accountant (Deterministic): A logic-gate agent that calculates ROI. Crucially, it has “Veto Power”—if ROI is negative, it halts the automated flow and triggers a “Human-in-the-Loop” requirement.
- Agent 3: The Synthesizer (Probabilistic): An LLM-backed agent that ingests the raw data from Agents 1 & 2, retrieves strategic context from a Vector Database, and writes an executive summary.
The Hybrid RAG Stack
We encountered a common hurdle: AWS Bedrock’s strict region-based model access (specifically the ValidationException for Titan Embeddings in London/eu-west-2).
The Solution: A Hybrid Approach.
- Vector Store: ChromaDB (running locally in-memory/disk).
- Embeddings: Hugging Face (
BAAI/bge-small-en-v1.5). This runs entirely on the application’s CPU. It creates vectors without making a single API call, reducing latency and eliminating AWS permission headaches. - LLM: AWS Bedrock (
Claude 3 Sonnet). We reserve the API calls for where they matter most: complex reasoning and text generation.
Deployment Guide: Serverless Containerization
Deploying stateful RAG apps on AWS Lambda (which is stateless and read-only) requires specific adaptations.
1. The Challenge: Read-Only Filesystem
Lambda only allows write access to /tmp. Standard ChromaDB and Hugging Face implementations try to write to the user’s home directory.
- Fix: We introduced environment detection logic.Python
IS_LAMBDA = os.getenv('LAMBDA_TASK_ROOT') is not None # If Lambda, save vectors to /tmp, otherwise use local project folder CHROMA_PATH = "/tmp/chroma_db_local" if IS_LAMBDA else "./chroma_db_local"
2. The Solution: AWS Lambda Web Adapter
We use the AWS Lambda Web Adapter, a tool that allows standard web apps (Flask, Streamlit, FastAPI) to run on Lambda without changing the application code to handle Lambda Events.
3. The Dockerfile
This multi-stage build is critical. It pre-downloads the embedding model during the build phase so the Lambda doesn’t time out downloading 130MB of model weights every time it starts.
Dockerfile
FROM python:3.11-slim
# 1. Install AWS Lambda Web Adapter
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.1 /lambda-adapter /opt/extensions/lambda-adapter
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# 2. PRE-BAKE THE MODEL (Critical Optimization)
# We run a script to download the HF model into the image
RUN python -c "from llama_index.embeddings.huggingface import HuggingFaceEmbedding; \
HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5', cache_folder='/app/model_cache')"
COPY . .
# 3. Streamlit Configuration
ENV PORT=8501
ENV AWS_LWA_INVOKE_MODE=response_stream
CMD ["streamlit", "run", "aws_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
4. AWS Deployment Steps
- Build & Push: Build the Docker image and push it to Amazon Elastic Container Registry (ECR).
- Lambda Creation: Create a Lambda function using the Container Image option.
- Permissions: Attach an IAM policy allowing
bedrock:InvokeModel. - Configuration: Set Memory to 2048MB (required for vector operations) and Timeout to 3 minutes.
- Access: Enable “Function URL” for a public HTTPS endpoint.
Separation of Concerns design pattern
Here is the breakdown of how the responsibilities are divided:
1. aws_tools.py (The “Backend” / The Brains)
This file contains business logic and AI integration. It knows how to do things, but it doesn’t know about the User Interface.
- LLM Configuration: It sets up the connection to AWS Bedrock (Claude).
- Embedding Logic: It downloads and runs the Hugging Face model.
- Agent Functions: It defines the “Tools” (
run_pclo_forecast,calculate_roi,synthesize_narrative). - Database Management: It handles writing to and reading from ChromaDB.
Benefit: If you ever wanted to switch from Streamlit to a FastAPI backend or a CLI tool, you could keep this file exactly as it is.
2. aws_app.py (The “Frontend” / The Orchestrator)
This file handles User Interaction and Workflow. It knows when to do things.
- UI Elements: Buttons, Sliders, Tabs, Metrics, JSON display.
- State Management: Remembering if a file is uploaded or if a button was clicked (
st.session_state). - Orchestration: The
SupervisorAgentSimulatorclass lives here. It decides the order of operations (First Forecast -> Then ROI -> Then Check Decision -> Then Synthesis).
Summary
aws_tools.py= The Workers (doing the calculations and AI generation).aws_app.py= The Manager (telling the workers when to start and showing their results to the user).
This makes your code much easier to debug and deploy!
Conclusion
By decoupling the embeddings (Local) from the reasoning (Cloud), we built a system that is robust, cost-effective, and easier to deploy. The Supervisor pattern ensures that the LLM is grounded in hard data, providing the reliability of code with the flexibility of AI.
Hybrid RAG Architecture
