Building a Hybrid Agentic Dashboard: Local RAG + AWS Bedrock on Serverless Lambda

Introduction

In the race to adopt Generative AI, developers often face a trade-off between latency/cost and intelligence. Pure cloud solutions (sending everything to OpenAI or AWS Bedrock) can become expensive and slow due to network latency. Pure local solutions (running LLMs on a CPU) lack the reasoning power of frontier models.

This article details the architecture of a Hybrid Agentic System that solves this problem. We built a predictive dashboard using Streamlit that orchestrates three specialized agents. It utilizes local Hugging Face embeddings for speed and zero cost, paired with AWS Bedrock (Claude 3 Sonnet) for high-level synthesis, all deployed on AWS Lambda via Docker containers.

The Architecture: “The Supervisor Pattern”

Instead of a single monolithic LLM prompt, we employed a Supervisor-Worker multi-agent architecture.

  1. The Supervisor (Orchestrator): A Python class that manages state, handles errors, and dictates the sequence of execution.
  2. Agent 1: The Forecaster (Deterministic): A specialized Python function that connects to an ML endpoint (or runs a simulation) to predict revenue based on market segments.
  3. Agent 2: The Accountant (Deterministic): A logic-gate agent that calculates ROI. Crucially, it has “Veto Power”—if ROI is negative, it halts the automated flow and triggers a “Human-in-the-Loop” requirement.
  4. Agent 3: The Synthesizer (Probabilistic): An LLM-backed agent that ingests the raw data from Agents 1 & 2, retrieves strategic context from a Vector Database, and writes an executive summary.

The Hybrid RAG Stack

We encountered a common hurdle: AWS Bedrock’s strict region-based model access (specifically the ValidationException for Titan Embeddings in London/eu-west-2).

The Solution: A Hybrid Approach.

  • Vector Store: ChromaDB (running locally in-memory/disk).
  • Embeddings: Hugging Face (BAAI/bge-small-en-v1.5). This runs entirely on the application’s CPU. It creates vectors without making a single API call, reducing latency and eliminating AWS permission headaches.
  • LLM: AWS Bedrock (Claude 3 Sonnet). We reserve the API calls for where they matter most: complex reasoning and text generation.

Deployment Guide: Serverless Containerization

Deploying stateful RAG apps on AWS Lambda (which is stateless and read-only) requires specific adaptations.

1. The Challenge: Read-Only Filesystem

Lambda only allows write access to /tmp. Standard ChromaDB and Hugging Face implementations try to write to the user’s home directory.

  • Fix: We introduced environment detection logic.PythonIS_LAMBDA = os.getenv('LAMBDA_TASK_ROOT') is not None # If Lambda, save vectors to /tmp, otherwise use local project folder CHROMA_PATH = "/tmp/chroma_db_local" if IS_LAMBDA else "./chroma_db_local"

2. The Solution: AWS Lambda Web Adapter

We use the AWS Lambda Web Adapter, a tool that allows standard web apps (Flask, Streamlit, FastAPI) to run on Lambda without changing the application code to handle Lambda Events.

3. The Dockerfile

This multi-stage build is critical. It pre-downloads the embedding model during the build phase so the Lambda doesn’t time out downloading 130MB of model weights every time it starts.

Dockerfile

FROM python:3.11-slim

# 1. Install AWS Lambda Web Adapter
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.8.1 /lambda-adapter /opt/extensions/lambda-adapter

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# 2. PRE-BAKE THE MODEL (Critical Optimization)
# We run a script to download the HF model into the image
RUN python -c "from llama_index.embeddings.huggingface import HuggingFaceEmbedding; \
    HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5', cache_folder='/app/model_cache')"

COPY . .

# 3. Streamlit Configuration
ENV PORT=8501
ENV AWS_LWA_INVOKE_MODE=response_stream 
CMD ["streamlit", "run", "aws_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

4. AWS Deployment Steps

  1. Build & Push: Build the Docker image and push it to Amazon Elastic Container Registry (ECR).
  2. Lambda Creation: Create a Lambda function using the Container Image option.
  3. Permissions: Attach an IAM policy allowing bedrock:InvokeModel.
  4. Configuration: Set Memory to 2048MB (required for vector operations) and Timeout to 3 minutes.
  5. Access: Enable “Function URL” for a public HTTPS endpoint.

Separation of Concerns design pattern

Here is the breakdown of how the responsibilities are divided:

1. aws_tools.py (The “Backend” / The Brains)

This file contains business logic and AI integration. It knows how to do things, but it doesn’t know about the User Interface.

  • LLM Configuration: It sets up the connection to AWS Bedrock (Claude).
  • Embedding Logic: It downloads and runs the Hugging Face model.
  • Agent Functions: It defines the “Tools” (run_pclo_forecast, calculate_roi, synthesize_narrative).
  • Database Management: It handles writing to and reading from ChromaDB.

Benefit: If you ever wanted to switch from Streamlit to a FastAPI backend or a CLI tool, you could keep this file exactly as it is.

2. aws_app.py (The “Frontend” / The Orchestrator)

This file handles User Interaction and Workflow. It knows when to do things.

  • UI Elements: Buttons, Sliders, Tabs, Metrics, JSON display.
  • State Management: Remembering if a file is uploaded or if a button was clicked (st.session_state).
  • Orchestration: The SupervisorAgentSimulator class lives here. It decides the order of operations (First Forecast -> Then ROI -> Then Check Decision -> Then Synthesis).

Summary

  • aws_tools.py = The Workers (doing the calculations and AI generation).
  • aws_app.py = The Manager (telling the workers when to start and showing their results to the user).

This makes your code much easier to debug and deploy!

Conclusion

By decoupling the embeddings (Local) from the reasoning (Cloud), we built a system that is robust, cost-effective, and easier to deploy. The Supervisor pattern ensures that the LLM is grounded in hard data, providing the reliability of code with the flexibility of AI.

Hybrid RAG Architecture