This is a demonstration of an LLM application architecture that combines a "Head-Tail + Rolling Summary Memory" context strategy with Context-Isolated Subagents.
The system addresses the "Lost in the Middle" problem and mitigates context window bloat by keeping a small, active conversation window (The Tail) and periodically condensing older conversation turns into a flat memory text file (The Middle).
context_manager/
├── app/
│ ├── main.py # FastAPI application setup
│ ├── api/
│ │ ├── chat.py # Chat endpoint, handles handoffs & memory
│ │ └── business.py # Upload endpoints (/upload-docs)
│ ├── agents/
│ │ ├── orchestrator.py # Main router agent
│ │ ├── faq_agent.py # Subagent for business rules
│ │ └── accommodation.py # Subagent for special requests
│ ├── memory/
│ │ ├── context_manager.py # Assembles Head, Middle (retrieved), and Tail
│ │ ├── memory_updater.py # Rolling memory compression as background task
│ │ ├── vector_store.py # Vector DB interface (Chroma/Qdrant)
│ │ └── sql_db.py # Relational DB for booking state/Head context
│ └── schemas/
│ └── models.py # Pydantic models for API validation
├── requirements.txt # Python libraries
└── .env # API keys
Instead of running expensive vector database queries or generating text embeddings on every single message turn, the context window is constructed dynamically:
- The Head (Fixed/System Prompt): Contains instructions, the user's booking ID, and current relational state.
-
The Middle (Rolling Summary File): A static text file (
local_memory/{session_id}_memory.txt) updated every$N$ turns containing key compressed facts. -
The Tail (Unsummarized Messages): The last
$N$ raw, uncompressed turns.
- Extremely Low Latency: For 9 out of 10 messages, the system reads a plain text file. No database latency or embeddings generation blocks the chat loop.
- "Breathing" Context Window: Token consumption resembles a sawtooth wave. It grows slightly with each turn, then drops back to near-zero as soon as the threshold is hit and messages are compressed.
- Cheaper Models for Maintenance: The memory compaction background task uses a fast, low-cost model (like
gpt-4o-mini), leaving the smarter model (likegpt-4o) free to handle complex routing.
Instead of passing the entire conversation history down to subagents:
- The Orchestrator evaluates the query and returns a concise, single-sentence
task_summary. - The Context Manager intercepts the handoff and constructs a pristine, isolated prompt tailored specifically to the target subagent.
- The Subagents receive only the specific data or system parameters required for their specific actions, keeping confusion and latency to an absolute minimum.
This codebase acts as a structural prototype with mock integrations and stubs for standard database and vector store calls.
Install the required packages:
pip install -r requirements.txtLaunch the development server from the post_booking_agent folder:
uvicorn app.main:app --reloadThe server will start up on http://127.0.0.1:8000. You can explore the automated interactive documentation via Swagger at http://127.0.0.1:8000/docs.