1226 lines
37 KiB
Markdown
1226 lines
37 KiB
Markdown
|
||
# Context Window Management Skill
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when:
|
||
- Processing documents longer than model context limit
|
||
- Building multi-turn conversational agents
|
||
- Implementing RAG systems with retrieved context
|
||
- Handling user inputs of unknown length
|
||
- Managing long-running conversations (customer support, assistants)
|
||
- Optimizing cost and latency for context-heavy applications
|
||
|
||
**When NOT to use:** Short, fixed-length inputs guaranteed to fit in context (e.g., tweet classification, short form filling).
|
||
|
||
## Core Principle
|
||
|
||
**Context is finite. Managing it is mandatory.**
|
||
|
||
LLM context windows have hard limits:
|
||
- GPT-3.5-turbo: 4k tokens (~3k words)
|
||
- GPT-3.5-turbo-16k: 16k tokens (~12k words)
|
||
- GPT-4: 8k tokens (~6k words)
|
||
- GPT-4-turbo: 128k tokens (~96k words)
|
||
- Claude 3 Sonnet: 200k tokens (~150k words)
|
||
|
||
Exceeding these limits = API crash. No graceful degradation. Token counting and management are not optional.
|
||
|
||
**Formula:** Token counting (prevent overflow) + Budgeting (allocate efficiently) + Management strategy (truncation/chunking/summarization) = Robust context handling.
|
||
|
||
## Context Management Framework
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────┐
|
||
│ 1. Count Tokens │
|
||
│ tiktoken, model-specific encoding │
|
||
└────────────┬─────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────┐
|
||
│ 2. Check Against Limits │
|
||
│ Model-specific context windows │
|
||
└────────────┬─────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌──────────────────────────────────────────────────┐
|
||
│ 3. Token Budget Allocation │
|
||
│ System + Context + Query + Output │
|
||
└────────────┬─────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌────┴────┐
|
||
│ Fits? │
|
||
└────┬────┘
|
||
┌──────┴──────┐
|
||
│ Yes │ No
|
||
▼ ▼
|
||
┌─────────┐ ┌─────────────────────┐
|
||
│ Proceed │ │ Choose Strategy: │
|
||
└─────────┘ │ • Chunking │
|
||
│ • Truncation │
|
||
│ • Summarization │
|
||
│ • Larger model │
|
||
│ • Compression │
|
||
└─────────┬───────────┘
|
||
│
|
||
▼
|
||
┌──────────────────┐
|
||
│ Apply & Validate │
|
||
└──────────────────┘
|
||
```
|
||
|
||
## Part 1: Token Counting
|
||
|
||
### Why Token Counting Matters
|
||
|
||
LLMs tokenize text (not characters or words). Token counts vary by:
|
||
- Language (English ~4 chars/token, Chinese ~2 chars/token)
|
||
- Content (code ~3 chars/token, prose ~4.5 chars/token)
|
||
- Model (different tokenizers)
|
||
|
||
**Character/word counts are unreliable estimates.**
|
||
|
||
### Tiktoken: OpenAI's Tokenizer
|
||
|
||
**Installation:**
|
||
```bash
|
||
pip install tiktoken
|
||
```
|
||
|
||
**Basic Usage:**
|
||
|
||
```python
|
||
import tiktoken
|
||
|
||
def count_tokens(text, model="gpt-3.5-turbo"):
|
||
"""
|
||
Count tokens for given text and model.
|
||
|
||
Args:
|
||
text: String to tokenize
|
||
model: Model name (determines tokenizer)
|
||
|
||
Returns:
|
||
Number of tokens
|
||
"""
|
||
try:
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
except KeyError:
|
||
# Fallback for unknown models
|
||
encoding = tiktoken.get_encoding("cl100k_base") # GPT-4/3.5-turbo
|
||
|
||
return len(encoding.encode(text))
|
||
|
||
# Examples
|
||
text = "Hello, how are you today?"
|
||
print(f"Tokens: {count_tokens(text)}") # Output: 7 tokens
|
||
|
||
document = "Large document with 10,000 words..."
|
||
tokens = count_tokens(document, model="gpt-4")
|
||
print(f"Document tokens: {tokens:,}") # Output: Document tokens: 13,421
|
||
```
|
||
|
||
**Encoding Types by Model:**
|
||
|
||
| Model | Encoding | Notes |
|
||
|-------|----------|-------|
|
||
| gpt-3.5-turbo | cl100k_base | Default for GPT-3.5/4 |
|
||
| gpt-4 | cl100k_base | Same as GPT-3.5 |
|
||
| gpt-4-turbo | cl100k_base | Same as GPT-3.5 |
|
||
| text-davinci-003 | p50k_base | Legacy GPT-3 |
|
||
| code-davinci-002 | p50k_base | Codex |
|
||
|
||
**Counting Chat Messages:**
|
||
|
||
```python
|
||
def count_message_tokens(messages, model="gpt-3.5-turbo"):
|
||
"""
|
||
Count tokens in chat completion messages.
|
||
|
||
Chat format adds overhead: role names, formatting tokens.
|
||
"""
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
tokens = 0
|
||
|
||
# Message formatting overhead (varies by model)
|
||
tokens_per_message = 3 # Every message: <|im_start|>role\n, <|im_end|>\n
|
||
tokens_per_name = 1 # If name field present
|
||
|
||
for message in messages:
|
||
tokens += tokens_per_message
|
||
for key, value in message.items():
|
||
tokens += len(encoding.encode(value))
|
||
if key == "name":
|
||
tokens += tokens_per_name
|
||
|
||
tokens += 3 # Every reply starts with assistant message
|
||
|
||
return tokens
|
||
|
||
# Example
|
||
messages = [
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Tell me about Python."},
|
||
{"role": "assistant", "content": "Python is a high-level programming language..."}
|
||
]
|
||
|
||
total_tokens = count_message_tokens(messages)
|
||
print(f"Total tokens: {total_tokens}")
|
||
```
|
||
|
||
**Token Estimation (Quick Approximation):**
|
||
|
||
```python
|
||
def estimate_tokens(text):
|
||
"""
|
||
Quick estimation: ~4 characters per token for English prose.
|
||
|
||
Not accurate for API calls! Use tiktoken for production.
|
||
Useful for rough checks and dashboards.
|
||
"""
|
||
return len(text) // 4
|
||
|
||
# Example
|
||
text = "This is a sample text for estimation."
|
||
estimated = estimate_tokens(text)
|
||
actual = count_tokens(text)
|
||
print(f"Estimated: {estimated}, Actual: {actual}")
|
||
# Output: Estimated: 9, Actual: 10 (close but not exact)
|
||
```
|
||
|
||
|
||
## Part 2: Model Context Limits and Budgeting
|
||
|
||
### Context Window Sizes
|
||
|
||
```python
|
||
MODEL_LIMITS = {
|
||
# OpenAI GPT-3.5
|
||
"gpt-3.5-turbo": 4_096,
|
||
"gpt-3.5-turbo-16k": 16_384,
|
||
|
||
# OpenAI GPT-4
|
||
"gpt-4": 8_192,
|
||
"gpt-4-32k": 32_768,
|
||
"gpt-4-turbo": 128_000,
|
||
"gpt-4-turbo-2024-04-09": 128_000,
|
||
|
||
# Anthropic Claude
|
||
"claude-3-opus": 200_000,
|
||
"claude-3-sonnet": 200_000,
|
||
"claude-3-haiku": 200_000,
|
||
|
||
# Open source
|
||
"llama-2-7b": 4_096,
|
||
"llama-2-13b": 4_096,
|
||
"llama-2-70b": 4_096,
|
||
"mistral-7b": 8_192,
|
||
"mixtral-8x7b": 32_768,
|
||
}
|
||
|
||
def get_context_limit(model):
|
||
"""Get context window size for model."""
|
||
return MODEL_LIMITS.get(model, 4_096) # Default: 4k
|
||
```
|
||
|
||
### Token Budget Allocation
|
||
|
||
For systems with multiple components (RAG, chat with history), allocate tokens:
|
||
|
||
```python
|
||
def calculate_token_budget(
|
||
model="gpt-3.5-turbo",
|
||
system_message_tokens=None,
|
||
query_tokens=None,
|
||
output_tokens=500,
|
||
safety_margin=50
|
||
):
|
||
"""
|
||
Calculate remaining budget for context (e.g., retrieved documents).
|
||
|
||
Args:
|
||
model: LLM model name
|
||
system_message_tokens: Tokens in system message (if known)
|
||
query_tokens: Tokens in user query (if known)
|
||
output_tokens: Reserved tokens for model output
|
||
safety_margin: Extra buffer to prevent edge cases
|
||
|
||
Returns:
|
||
Available tokens for context
|
||
"""
|
||
total_limit = MODEL_LIMITS[model]
|
||
|
||
# Reserve tokens
|
||
reserved = (
|
||
(system_message_tokens or 100) + # System message (estimate if unknown)
|
||
(query_tokens or 100) + # User query (estimate if unknown)
|
||
output_tokens + # Model response
|
||
safety_margin # Safety buffer
|
||
)
|
||
|
||
context_budget = total_limit - reserved
|
||
|
||
return {
|
||
'total_limit': total_limit,
|
||
'context_budget': context_budget,
|
||
'reserved_system': system_message_tokens or 100,
|
||
'reserved_query': query_tokens or 100,
|
||
'reserved_output': output_tokens,
|
||
'safety_margin': safety_margin
|
||
}
|
||
|
||
# Example
|
||
budget = calculate_token_budget(
|
||
model="gpt-3.5-turbo",
|
||
system_message_tokens=50,
|
||
query_tokens=20,
|
||
output_tokens=500
|
||
)
|
||
|
||
print(f"Total limit: {budget['total_limit']:,}")
|
||
print(f"Context budget: {budget['context_budget']:,}")
|
||
# Output:
|
||
# Total limit: 4,096
|
||
# Context budget: 3,376 (can use for retrieved docs, chat history, etc.)
|
||
```
|
||
|
||
**RAG Token Budgeting:**
|
||
|
||
```python
|
||
def budget_for_rag(
|
||
query,
|
||
system_message="You are a helpful assistant. Answer using the provided context.",
|
||
model="gpt-3.5-turbo",
|
||
output_tokens=500
|
||
):
|
||
"""Calculate available tokens for retrieved documents in RAG."""
|
||
system_tokens = count_tokens(system_message, model)
|
||
query_tokens = count_tokens(query, model)
|
||
|
||
budget = calculate_token_budget(
|
||
model=model,
|
||
system_message_tokens=system_tokens,
|
||
query_tokens=query_tokens,
|
||
output_tokens=output_tokens
|
||
)
|
||
|
||
return budget['context_budget']
|
||
|
||
# Example
|
||
query = "What is the company's return policy for defective products?"
|
||
available_tokens = budget_for_rag(query, model="gpt-3.5-turbo")
|
||
print(f"Available tokens for retrieved documents: {available_tokens}")
|
||
# Output: Available tokens for retrieved documents: 3,376
|
||
|
||
# This means we can retrieve ~3,376 tokens worth of documents
|
||
# At ~500 tokens/chunk, that's 6-7 document chunks
|
||
```
|
||
|
||
|
||
## Part 3: Chunking Strategies
|
||
|
||
When document exceeds context limit, split into chunks and process separately.
|
||
|
||
### Fixed-Size Chunking
|
||
|
||
**Simple approach:** Split into equal-sized chunks.
|
||
|
||
```python
|
||
def chunk_by_tokens(text, chunk_size=1000, overlap=200, model="gpt-3.5-turbo"):
|
||
"""
|
||
Split text into fixed-size token chunks with overlap.
|
||
|
||
Args:
|
||
text: Text to chunk
|
||
chunk_size: Target tokens per chunk
|
||
overlap: Overlapping tokens between chunks (for continuity)
|
||
model: Model for tokenization
|
||
|
||
Returns:
|
||
List of text chunks
|
||
"""
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
tokens = encoding.encode(text)
|
||
|
||
chunks = []
|
||
start = 0
|
||
|
||
while start < len(tokens):
|
||
end = start + chunk_size
|
||
chunk_tokens = tokens[start:end]
|
||
chunk_text = encoding.decode(chunk_tokens)
|
||
chunks.append(chunk_text)
|
||
|
||
start += chunk_size - overlap # Overlap for continuity
|
||
|
||
return chunks
|
||
|
||
# Example
|
||
document = "Very long document with 10,000 tokens..." * 1000
|
||
chunks = chunk_by_tokens(document, chunk_size=1000, overlap=200)
|
||
print(f"Split into {len(chunks)} chunks")
|
||
for i, chunk in enumerate(chunks[:3]):
|
||
print(f"Chunk {i+1}: {count_tokens(chunk)} tokens")
|
||
```
|
||
|
||
**Pros:**
|
||
- Simple, predictable chunk sizes
|
||
- Works for any text
|
||
|
||
**Cons:**
|
||
- May split mid-sentence, mid-paragraph (poor semantic boundaries)
|
||
- Overlap creates redundancy
|
||
- No awareness of document structure
|
||
|
||
### Semantic Chunking
|
||
|
||
**Better approach:** Split at semantic boundaries (paragraphs, sections).
|
||
|
||
```python
|
||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||
|
||
def chunk_semantically(text, chunk_size=1000, overlap=200):
|
||
"""
|
||
Split text at semantic boundaries (paragraphs, sentences).
|
||
|
||
Uses LangChain's RecursiveCharacterTextSplitter which tries:
|
||
1. Split by paragraphs (\n\n)
|
||
2. If chunk still too large, split by sentences (. )
|
||
3. If sentence still too large, split by words
|
||
4. Last resort: split by characters
|
||
"""
|
||
splitter = RecursiveCharacterTextSplitter(
|
||
chunk_size=chunk_size * 4, # Approximate: 4 chars/token
|
||
chunk_overlap=overlap * 4,
|
||
separators=["\n\n", "\n", ". ", " ", ""], # Priority order
|
||
length_function=lambda text: count_tokens(text) # Use actual token count
|
||
)
|
||
|
||
chunks = splitter.split_text(text)
|
||
return chunks
|
||
|
||
# Example
|
||
document = """
|
||
# Introduction
|
||
|
||
This is the introduction to the document.
|
||
It contains several paragraphs of introductory material.
|
||
|
||
## Methods
|
||
|
||
The methods section describes the experimental procedure.
|
||
We used a randomized controlled trial with 100 participants.
|
||
|
||
## Results
|
||
|
||
The results show significant improvements in...
|
||
"""
|
||
|
||
chunks = chunk_semantically(document, chunk_size=500, overlap=50)
|
||
for i, chunk in enumerate(chunks):
|
||
print(f"Chunk {i+1} ({count_tokens(chunk)} tokens):\n{chunk[:100]}...\n")
|
||
```
|
||
|
||
**Pros:**
|
||
- Respects semantic boundaries (complete paragraphs, sentences)
|
||
- Better context preservation
|
||
- More readable chunks
|
||
|
||
**Cons:**
|
||
- Chunk sizes vary (some may be too large)
|
||
- More complex implementation
|
||
|
||
### Hierarchical Chunking (Map-Reduce)
|
||
|
||
**Best for summarization:** Summarize chunks, then summarize summaries.
|
||
|
||
```python
|
||
def hierarchical_summarization(document, chunk_size=3000, model="gpt-3.5-turbo"):
|
||
"""
|
||
Summarize long document using map-reduce approach.
|
||
|
||
1. Split document into chunks (MAP)
|
||
2. Summarize each chunk individually
|
||
3. Combine chunk summaries (REDUCE)
|
||
4. Generate final summary from combined summaries
|
||
"""
|
||
import openai
|
||
|
||
# Step 1: Chunk document
|
||
chunks = chunk_semantically(document, chunk_size=chunk_size)
|
||
print(f"Split into {len(chunks)} chunks")
|
||
|
||
# Step 2: Summarize each chunk (MAP)
|
||
chunk_summaries = []
|
||
for i, chunk in enumerate(chunks):
|
||
response = openai.ChatCompletion.create(
|
||
model=model,
|
||
messages=[
|
||
{"role": "system", "content": "Summarize the following text concisely."},
|
||
{"role": "user", "content": chunk}
|
||
],
|
||
temperature=0
|
||
)
|
||
summary = response.choices[0].message.content
|
||
chunk_summaries.append(summary)
|
||
print(f"Chunk {i+1} summary: {summary[:100]}...")
|
||
|
||
# Step 3: Combine summaries (REDUCE)
|
||
combined_summaries = "\n\n".join(chunk_summaries)
|
||
|
||
# Step 4: Generate final summary
|
||
final_response = openai.ChatCompletion.create(
|
||
model=model,
|
||
messages=[
|
||
{"role": "system", "content": "Synthesize the following summaries into a comprehensive final summary."},
|
||
{"role": "user", "content": combined_summaries}
|
||
],
|
||
temperature=0
|
||
)
|
||
|
||
final_summary = final_response.choices[0].message.content
|
||
return final_summary
|
||
|
||
# Example
|
||
long_document = "Research paper with 50,000 tokens..." * 100
|
||
summary = hierarchical_summarization(long_document, chunk_size=3000)
|
||
print(f"Final summary:\n{summary}")
|
||
```
|
||
|
||
**Pros:**
|
||
- Handles arbitrarily long documents
|
||
- Preserves information across entire document
|
||
- Parallelizable (summarize chunks concurrently)
|
||
|
||
**Cons:**
|
||
- More API calls (higher cost)
|
||
- Information loss in successive summarizations
|
||
- Slower than single-pass
|
||
|
||
|
||
## Part 4: Intelligent Truncation Strategies
|
||
|
||
When chunking isn't appropriate (e.g., single-pass QA), truncate intelligently.
|
||
|
||
### Strategy 1: Truncate from Middle (Preserve Intro + Conclusion)
|
||
|
||
```python
|
||
def truncate_middle(text, max_tokens=3500, model="gpt-3.5-turbo"):
|
||
"""
|
||
Keep beginning and end, truncate middle.
|
||
|
||
Useful for documents with important intro (context) and conclusion (findings).
|
||
"""
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
tokens = encoding.encode(text)
|
||
|
||
if len(tokens) <= max_tokens:
|
||
return text # Fits, no truncation needed
|
||
|
||
# Allocate: 40% beginning, 40% end, 20% lost in middle
|
||
keep_start = int(max_tokens * 0.4)
|
||
keep_end = int(max_tokens * 0.4)
|
||
|
||
start_tokens = tokens[:keep_start]
|
||
end_tokens = tokens[-keep_end:]
|
||
|
||
# Add marker showing truncation
|
||
truncation_marker = encoding.encode("\n\n[... middle section truncated ...]\n\n")
|
||
|
||
truncated_tokens = start_tokens + truncation_marker + end_tokens
|
||
return encoding.decode(truncated_tokens)
|
||
|
||
# Example
|
||
document = """
|
||
Introduction: This paper presents a new approach to X.
|
||
Our hypothesis is that Y improves performance by 30%.
|
||
|
||
[... 10,000 tokens of methods, experiments, detailed results ...]
|
||
|
||
Conclusion: We demonstrated that Y improves performance by 31%,
|
||
confirming our hypothesis. Future work will explore Z.
|
||
"""
|
||
|
||
truncated = truncate_middle(document, max_tokens=500)
|
||
print(truncated)
|
||
# Output:
|
||
# Introduction: This paper presents...
|
||
# [... middle section truncated ...]
|
||
# Conclusion: We demonstrated that Y improves...
|
||
```
|
||
|
||
### Strategy 2: Truncate from Beginning (Keep Recent Context)
|
||
|
||
```python
|
||
def truncate_from_start(text, max_tokens=3500, model="gpt-3.5-turbo"):
|
||
"""
|
||
Keep end, discard beginning.
|
||
|
||
Useful for logs, conversations where recent context is most important.
|
||
"""
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
tokens = encoding.encode(text)
|
||
|
||
if len(tokens) <= max_tokens:
|
||
return text
|
||
|
||
# Keep last N tokens
|
||
truncated_tokens = tokens[-max_tokens:]
|
||
return encoding.decode(truncated_tokens)
|
||
|
||
# Example: Chat logs
|
||
conversation = """
|
||
[Turn 1 - 2 hours ago] User: How do I reset my password?
|
||
[Turn 2] Bot: Go to Settings > Security > Reset Password.
|
||
[... 50 turns ...]
|
||
[Turn 51 - just now] User: What was that password reset link again?
|
||
"""
|
||
|
||
truncated = truncate_from_start(conversation, max_tokens=200)
|
||
print(truncated)
|
||
# Output: [Turn 48] ... [Turn 51 - just now] User: What was that password reset link again?
|
||
```
|
||
|
||
### Strategy 3: Extractive Truncation (Keep Most Relevant)
|
||
|
||
```python
|
||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||
from sklearn.metrics.pairwise import cosine_similarity
|
||
import numpy as np
|
||
|
||
def extractive_truncation(document, query, max_tokens=3000, model="gpt-3.5-turbo"):
|
||
"""
|
||
Keep sentences most relevant to query.
|
||
|
||
Uses TF-IDF similarity to rank sentences by relevance to query.
|
||
"""
|
||
# Split into sentences
|
||
sentences = document.split('. ')
|
||
|
||
# Calculate TF-IDF similarity to query
|
||
vectorizer = TfidfVectorizer()
|
||
vectors = vectorizer.fit_transform([query] + sentences)
|
||
query_vec = vectors[0]
|
||
sentence_vecs = vectors[1:]
|
||
|
||
# Similarity scores
|
||
similarities = cosine_similarity(query_vec, sentence_vecs)[0]
|
||
|
||
# Rank sentences by similarity
|
||
ranked_indices = np.argsort(similarities)[::-1]
|
||
|
||
# Select sentences until token budget exhausted
|
||
selected_sentences = []
|
||
token_count = 0
|
||
encoding = tiktoken.encoding_for_model(model)
|
||
|
||
for idx in ranked_indices:
|
||
sentence = sentences[idx] + '. '
|
||
sentence_tokens = len(encoding.encode(sentence))
|
||
|
||
if token_count + sentence_tokens <= max_tokens:
|
||
selected_sentences.append((idx, sentence))
|
||
token_count += sentence_tokens
|
||
else:
|
||
break
|
||
|
||
# Sort selected sentences by original order (maintain flow)
|
||
selected_sentences.sort(key=lambda x: x[0])
|
||
|
||
return ''.join([sent for _, sent in selected_sentences])
|
||
|
||
# Example
|
||
document = """
|
||
The company was founded in 1995 in Seattle.
|
||
Our return policy allows returns within 30 days of purchase.
|
||
Products must be in original condition with tags attached.
|
||
Refunds are processed within 5-7 business days.
|
||
We offer free shipping on orders over $50.
|
||
The company has 500 employees worldwide.
|
||
"""
|
||
|
||
query = "What is the return policy?"
|
||
|
||
truncated = extractive_truncation(document, query, max_tokens=150)
|
||
print(truncated)
|
||
# Output: Our return policy allows returns within 30 days. Products must be in original condition. Refunds processed within 5-7 days.
|
||
```
|
||
|
||
|
||
## Part 5: Conversation Context Management
|
||
|
||
Multi-turn conversations require active context management to prevent unbounded growth.
|
||
|
||
### Strategy 1: Sliding Window
|
||
|
||
**Keep last N turns.**
|
||
|
||
```python
|
||
class SlidingWindowChatbot:
|
||
def __init__(self, model="gpt-3.5-turbo", max_history=10):
|
||
"""
|
||
Chatbot with sliding window context.
|
||
|
||
Args:
|
||
model: LLM model
|
||
max_history: Maximum conversation turns to keep (user+assistant pairs)
|
||
"""
|
||
self.model = model
|
||
self.max_history = max_history
|
||
self.system_message = {"role": "system", "content": "You are a helpful assistant."}
|
||
self.messages = [self.system_message]
|
||
|
||
def chat(self, user_message):
|
||
"""Add message, generate response, manage context."""
|
||
import openai
|
||
|
||
# Add user message
|
||
self.messages.append({"role": "user", "content": user_message})
|
||
|
||
# Apply sliding window (keep system + last N*2 messages)
|
||
if len(self.messages) > (self.max_history * 2 + 1): # +1 for system message
|
||
self.messages = [self.system_message] + self.messages[-(self.max_history * 2):]
|
||
|
||
# Generate response
|
||
response = openai.ChatCompletion.create(
|
||
model=self.model,
|
||
messages=self.messages
|
||
)
|
||
|
||
assistant_message = response.choices[0].message.content
|
||
self.messages.append({"role": "assistant", "content": assistant_message})
|
||
|
||
return assistant_message
|
||
|
||
# Example
|
||
bot = SlidingWindowChatbot(max_history=5) # Keep last 5 turns
|
||
|
||
for turn in range(20):
|
||
user_msg = input("You: ")
|
||
response = bot.chat(user_msg)
|
||
print(f"Bot: {response}")
|
||
|
||
# Context automatically managed: always ≤ 11 messages (1 system + 5*2 user/assistant)
|
||
```
|
||
|
||
**Pros:**
|
||
- Simple, predictable
|
||
- Constant memory/cost
|
||
- Recent context preserved
|
||
|
||
**Cons:**
|
||
- Loses old context (user may reference earlier conversation)
|
||
- Fixed window may be too small or too large
|
||
|
||
### Strategy 2: Token-Based Truncation
|
||
|
||
**Keep messages until token budget exhausted.**
|
||
|
||
```python
|
||
class TokenBudgetChatbot:
|
||
def __init__(self, model="gpt-3.5-turbo", max_tokens=3000):
|
||
"""
|
||
Chatbot with token-based context management.
|
||
|
||
Keeps messages until token budget exhausted (newest to oldest).
|
||
"""
|
||
self.model = model
|
||
self.max_tokens = max_tokens
|
||
self.system_message = {"role": "system", "content": "You are a helpful assistant."}
|
||
self.messages = [self.system_message]
|
||
|
||
def chat(self, user_message):
|
||
import openai
|
||
|
||
# Add user message
|
||
self.messages.append({"role": "user", "content": user_message})
|
||
|
||
# Token management: keep system + recent messages within budget
|
||
total_tokens = count_message_tokens(self.messages, self.model)
|
||
|
||
while total_tokens > self.max_tokens and len(self.messages) > 2:
|
||
# Remove oldest message (after system message)
|
||
removed = self.messages.pop(1)
|
||
total_tokens = count_message_tokens(self.messages, self.model)
|
||
|
||
# Generate response
|
||
response = openai.ChatCompletion.create(
|
||
model=self.model,
|
||
messages=self.messages
|
||
)
|
||
|
||
assistant_message = response.choices[0].message.content
|
||
self.messages.append({"role": "assistant", "content": assistant_message})
|
||
|
||
return assistant_message
|
||
|
||
# Example
|
||
bot = TokenBudgetChatbot(max_tokens=2000)
|
||
|
||
for turn in range(20):
|
||
user_msg = input("You: ")
|
||
response = bot.chat(user_msg)
|
||
print(f"Bot: {response}")
|
||
print(f"Context tokens: {count_message_tokens(bot.messages)}")
|
||
```
|
||
|
||
**Pros:**
|
||
- Adaptive to message length (long messages = fewer kept, short messages = more kept)
|
||
- Precise budget control
|
||
|
||
**Cons:**
|
||
- Removes from beginning (loses early context)
|
||
|
||
### Strategy 3: Summarization + Sliding Window
|
||
|
||
**Best of both: Summarize old context, keep recent verbatim.**
|
||
|
||
```python
|
||
class SummarizingChatbot:
|
||
def __init__(self, model="gpt-3.5-turbo", max_recent=5, summarize_threshold=10):
|
||
"""
|
||
Chatbot with summarization + sliding window.
|
||
|
||
When conversation exceeds threshold, summarize old turns and keep recent verbatim.
|
||
|
||
Args:
|
||
model: LLM model
|
||
max_recent: Recent turns to keep verbatim
|
||
summarize_threshold: Turns before summarizing old context
|
||
"""
|
||
self.model = model
|
||
self.max_recent = max_recent
|
||
self.summarize_threshold = summarize_threshold
|
||
self.system_message = {"role": "system", "content": "You are a helpful assistant."}
|
||
self.messages = [self.system_message]
|
||
self.summary = None # Stores summary of old context
|
||
|
||
def summarize_old_context(self):
|
||
"""Summarize older messages (beyond recent window)."""
|
||
import openai
|
||
|
||
# Messages to summarize: after system, before recent window
|
||
num_messages = len(self.messages) - 1 # Exclude system message
|
||
if num_messages <= self.summarize_threshold:
|
||
return # Not enough history yet
|
||
|
||
# Split: old (to summarize) vs recent (keep verbatim)
|
||
old_messages = self.messages[1:-(self.max_recent*2)] # Exclude system + recent
|
||
|
||
if not old_messages:
|
||
return
|
||
|
||
# Format for summarization
|
||
conversation_text = "\n".join([
|
||
f"{msg['role']}: {msg['content']}" for msg in old_messages
|
||
])
|
||
|
||
# Generate summary
|
||
response = openai.ChatCompletion.create(
|
||
model=self.model,
|
||
messages=[
|
||
{"role": "system", "content": "Summarize the following conversation concisely, capturing key information, user goals, and important context."},
|
||
{"role": "user", "content": conversation_text}
|
||
],
|
||
temperature=0
|
||
)
|
||
|
||
self.summary = response.choices[0].message.content
|
||
|
||
# Update messages: system + summary + recent
|
||
recent_messages = self.messages[-(self.max_recent*2):]
|
||
summary_message = {
|
||
"role": "system",
|
||
"content": f"Previous conversation summary: {self.summary}"
|
||
}
|
||
|
||
self.messages = [self.system_message, summary_message] + recent_messages
|
||
|
||
def chat(self, user_message):
|
||
import openai
|
||
|
||
# Add user message
|
||
self.messages.append({"role": "user", "content": user_message})
|
||
|
||
# Check if summarization needed
|
||
num_turns = (len(self.messages) - 1) // 2 # Exclude system message
|
||
if num_turns >= self.summarize_threshold:
|
||
self.summarize_old_context()
|
||
|
||
# Generate response
|
||
response = openai.ChatCompletion.create(
|
||
model=self.model,
|
||
messages=self.messages
|
||
)
|
||
|
||
assistant_message = response.choices[0].message.content
|
||
self.messages.append({"role": "assistant", "content": assistant_message})
|
||
|
||
return assistant_message
|
||
|
||
# Example
|
||
bot = SummarizingChatbot(max_recent=5, summarize_threshold=10)
|
||
|
||
# Long conversation
|
||
for turn in range(25):
|
||
user_msg = input("You: ")
|
||
response = bot.chat(user_msg)
|
||
print(f"Bot: {response}")
|
||
|
||
# After turn 10, old context (turns 1-5) summarized, turns 6-10+ kept verbatim
|
||
```
|
||
|
||
**Pros:**
|
||
- Preserves full conversation history (in summary form)
|
||
- Recent context verbatim (maintains fluency)
|
||
- Bounded token usage
|
||
|
||
**Cons:**
|
||
- Extra API call for summarization (cost)
|
||
- Information loss in summary
|
||
- More complex
|
||
|
||
|
||
## Part 6: RAG Context Management
|
||
|
||
RAG systems retrieve documents and include in context. Token budgeting is critical.
|
||
|
||
### Dynamic Document Retrieval (Budget-Aware)
|
||
|
||
```python
|
||
def retrieve_with_token_budget(
|
||
query,
|
||
documents,
|
||
embeddings,
|
||
model="gpt-3.5-turbo",
|
||
output_tokens=500,
|
||
max_docs=20
|
||
):
|
||
"""
|
||
Retrieve documents dynamically based on token budget.
|
||
|
||
Args:
|
||
query: User query
|
||
documents: List of document dicts [{"id": ..., "content": ...}, ...]
|
||
embeddings: Pre-computed document embeddings
|
||
model: LLM model
|
||
output_tokens: Reserved for output
|
||
max_docs: Maximum documents to consider
|
||
|
||
Returns:
|
||
Selected documents within token budget
|
||
"""
|
||
from sentence_transformers import SentenceTransformer, util
|
||
|
||
# Calculate available token budget
|
||
available_tokens = budget_for_rag(query, model=model, output_tokens=output_tokens)
|
||
|
||
# Retrieve top-k relevant documents (semantic search)
|
||
query_embedding = SentenceTransformer('all-MiniLM-L6-v2').encode(query)
|
||
similarities = util.cos_sim(query_embedding, embeddings)[0]
|
||
top_indices = similarities.argsort(descending=True)[:max_docs]
|
||
|
||
# Select documents until budget exhausted
|
||
selected_docs = []
|
||
token_count = 0
|
||
|
||
for idx in top_indices:
|
||
doc = documents[idx]
|
||
doc_tokens = count_tokens(doc['content'], model)
|
||
|
||
if token_count + doc_tokens <= available_tokens:
|
||
selected_docs.append(doc)
|
||
token_count += doc_tokens
|
||
else:
|
||
# Budget exhausted
|
||
break
|
||
|
||
return selected_docs, token_count
|
||
|
||
# Example
|
||
query = "What is our return policy?"
|
||
documents = [
|
||
{"id": 1, "content": "Our return policy allows returns within 30 days..."},
|
||
{"id": 2, "content": "Shipping is free on orders over $50..."},
|
||
# ... 100 more documents
|
||
]
|
||
|
||
selected, tokens_used = retrieve_with_token_budget(
|
||
query, documents, embeddings, model="gpt-3.5-turbo"
|
||
)
|
||
|
||
print(f"Selected {len(selected)} documents using {tokens_used} tokens")
|
||
# Output: Selected 7 documents using 3,280 tokens (within budget)
|
||
```
|
||
|
||
### Chunk Re-Ranking with Token Budget
|
||
|
||
```python
|
||
def rerank_and_budget(query, chunks, model="gpt-3.5-turbo", max_tokens=3000):
|
||
"""
|
||
Over-retrieve, re-rank, then select top chunks within token budget.
|
||
|
||
1. Retrieve k=20 candidates (coarse retrieval)
|
||
2. Re-rank with cross-encoder (fine-grained scoring)
|
||
3. Select top chunks until budget exhausted
|
||
"""
|
||
from sentence_transformers import CrossEncoder
|
||
|
||
# Re-rank with cross-encoder
|
||
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
||
pairs = [[query, chunk['content']] for chunk in chunks]
|
||
scores = cross_encoder.predict(pairs)
|
||
|
||
# Sort by relevance
|
||
ranked_chunks = sorted(
|
||
zip(chunks, scores),
|
||
key=lambda x: x[1],
|
||
reverse=True
|
||
)
|
||
|
||
# Select until budget exhausted
|
||
selected_chunks = []
|
||
token_count = 0
|
||
|
||
for chunk, score in ranked_chunks:
|
||
chunk_tokens = count_tokens(chunk['content'], model)
|
||
|
||
if token_count + chunk_tokens <= max_tokens:
|
||
selected_chunks.append((chunk, score))
|
||
token_count += chunk_tokens
|
||
else:
|
||
break
|
||
|
||
return selected_chunks, token_count
|
||
|
||
# Example
|
||
chunks = [
|
||
{"id": 1, "content": "Return policy: 30 days with receipt..."},
|
||
{"id": 2, "content": "Shipping: Free over $50..."},
|
||
# ... 18 more chunks
|
||
]
|
||
|
||
selected, tokens = rerank_and_budget(query, chunks, max_tokens=3000)
|
||
print(f"Selected {len(selected)} chunks, {tokens} tokens")
|
||
```
|
||
|
||
|
||
## Part 7: Cost and Performance Optimization
|
||
|
||
Context management affects cost and latency.
|
||
|
||
### Cost Optimization
|
||
|
||
```python
|
||
def calculate_cost(tokens, model="gpt-3.5-turbo"):
|
||
"""
|
||
Calculate API cost based on token count.
|
||
|
||
Pricing (as of 2024):
|
||
- GPT-3.5-turbo: $0.002 per 1k tokens (input + output)
|
||
- GPT-4: $0.03 per 1k input, $0.06 per 1k output
|
||
- GPT-4-turbo: $0.01 per 1k input, $0.03 per 1k output
|
||
"""
|
||
pricing = {
|
||
"gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
|
||
"gpt-3.5-turbo-16k": {"input": 0.003, "output": 0.004},
|
||
"gpt-4": {"input": 0.03, "output": 0.06},
|
||
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
|
||
}
|
||
|
||
rates = pricing.get(model, {"input": 0.002, "output": 0.002})
|
||
input_cost = (tokens / 1000) * rates["input"]
|
||
|
||
return input_cost
|
||
|
||
# Example: Cost comparison
|
||
conversation_tokens = 3500
|
||
print(f"GPT-3.5: ${calculate_cost(conversation_tokens, 'gpt-3.5-turbo'):.4f}")
|
||
print(f"GPT-4: ${calculate_cost(conversation_tokens, 'gpt-4'):.4f}")
|
||
# Output:
|
||
# GPT-3.5: $0.0053
|
||
# GPT-4: $0.1050 (20× more expensive!)
|
||
```
|
||
|
||
**Cost optimization strategies:**
|
||
1. **Compression:** Summarize old context (reduce tokens)
|
||
2. **Smaller model:** Use GPT-3.5 instead of GPT-4 when possible
|
||
3. **Efficient retrieval:** Retrieve fewer, more relevant docs
|
||
4. **Caching:** Cache embeddings, avoid re-encoding
|
||
|
||
### Latency Optimization
|
||
|
||
```python
|
||
# Latency increases with context length
|
||
import time
|
||
|
||
def measure_latency(context_tokens, model="gpt-3.5-turbo"):
|
||
"""
|
||
Rough latency estimates (actual varies by API load).
|
||
|
||
Latency = Fixed overhead + (tokens × per-token time)
|
||
"""
|
||
fixed_overhead_ms = 500 # API call, network
|
||
time_per_token_ms = {
|
||
"gpt-3.5-turbo": 0.3, # ~300ms per 1k tokens
|
||
"gpt-4": 1.0, # ~1s per 1k tokens (slower)
|
||
}
|
||
|
||
per_token = time_per_token_ms.get(model, 0.5)
|
||
latency_ms = fixed_overhead_ms + (context_tokens * per_token)
|
||
|
||
return latency_ms
|
||
|
||
# Example
|
||
for tokens in [500, 2000, 5000, 10000]:
|
||
latency = measure_latency(tokens, "gpt-3.5-turbo")
|
||
print(f"{tokens:,} tokens: {latency:.0f}ms ({latency/1000:.1f}s)")
|
||
# Output:
|
||
# 500 tokens: 650ms (0.7s)
|
||
# 2,000 tokens: 1,100ms (1.1s)
|
||
# 5,000 tokens: 2,000ms (2.0s)
|
||
# 10,000 tokens: 3,500ms (3.5s)
|
||
```
|
||
|
||
**Latency optimization strategies:**
|
||
1. **Reduce context:** Keep only essential information
|
||
2. **Parallel processing:** Process chunks concurrently (map-reduce)
|
||
3. **Streaming:** Stream responses for perceived latency reduction
|
||
4. **Caching:** Cache frequent queries
|
||
|
||
|
||
## Part 8: Complete Implementation Example
|
||
|
||
**RAG System with Full Context Management:**
|
||
|
||
```python
|
||
import openai
|
||
import tiktoken
|
||
from sentence_transformers import SentenceTransformer, util
|
||
|
||
class ManagedRAGSystem:
|
||
def __init__(
|
||
self,
|
||
model="gpt-3.5-turbo",
|
||
embedding_model="all-MiniLM-L6-v2",
|
||
max_docs=20,
|
||
output_tokens=500
|
||
):
|
||
self.model = model
|
||
self.embedding_model = SentenceTransformer(embedding_model)
|
||
self.max_docs = max_docs
|
||
self.output_tokens = output_tokens
|
||
|
||
def query(self, question, documents):
|
||
"""
|
||
Query RAG system with full context management.
|
||
|
||
Steps:
|
||
1. Calculate token budget
|
||
2. Retrieve relevant documents within budget
|
||
3. Build context
|
||
4. Generate response
|
||
5. Return response with metadata
|
||
"""
|
||
# Step 1: Calculate token budget
|
||
system_message = "Answer the question using only the provided context."
|
||
budget = calculate_token_budget(
|
||
model=self.model,
|
||
system_message_tokens=count_tokens(system_message),
|
||
query_tokens=count_tokens(question),
|
||
output_tokens=self.output_tokens
|
||
)
|
||
context_budget = budget['context_budget']
|
||
|
||
# Step 2: Retrieve documents within budget
|
||
query_embedding = self.embedding_model.encode(question)
|
||
doc_embeddings = self.embedding_model.encode([doc['content'] for doc in documents])
|
||
similarities = util.cos_sim(query_embedding, doc_embeddings)[0]
|
||
top_indices = similarities.argsort(descending=True)[:self.max_docs]
|
||
|
||
selected_docs = []
|
||
token_count = 0
|
||
|
||
for idx in top_indices:
|
||
doc = documents[idx]
|
||
doc_tokens = count_tokens(doc['content'], self.model)
|
||
|
||
if token_count + doc_tokens <= context_budget:
|
||
selected_docs.append(doc)
|
||
token_count += doc_tokens
|
||
else:
|
||
break
|
||
|
||
# Step 3: Build context
|
||
context = "\n\n".join([doc['content'] for doc in selected_docs])
|
||
|
||
# Step 4: Generate response
|
||
messages = [
|
||
{"role": "system", "content": system_message},
|
||
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
|
||
]
|
||
|
||
response = openai.ChatCompletion.create(
|
||
model=self.model,
|
||
messages=messages,
|
||
temperature=0
|
||
)
|
||
|
||
answer = response.choices[0].message.content
|
||
|
||
# Step 5: Return with metadata
|
||
return {
|
||
'answer': answer,
|
||
'num_docs_retrieved': len(selected_docs),
|
||
'context_tokens': token_count,
|
||
'total_tokens': response.usage.total_tokens,
|
||
'cost': calculate_cost(response.usage.total_tokens, self.model)
|
||
}
|
||
|
||
# Example usage
|
||
rag = ManagedRAGSystem(model="gpt-3.5-turbo")
|
||
|
||
documents = [
|
||
{"id": 1, "content": "Our return policy allows returns within 30 days of purchase with receipt."},
|
||
{"id": 2, "content": "Refunds are processed within 5-7 business days."},
|
||
# ... more documents
|
||
]
|
||
|
||
result = rag.query("What is the return policy?", documents)
|
||
|
||
print(f"Answer: {result['answer']}")
|
||
print(f"Retrieved: {result['num_docs_retrieved']} documents")
|
||
print(f"Context tokens: {result['context_tokens']}")
|
||
print(f"Total tokens: {result['total_tokens']}")
|
||
print(f"Cost: ${result['cost']:.4f}")
|
||
```
|
||
|
||
|
||
## Summary
|
||
|
||
**Context window management is mandatory for production LLM systems.**
|
||
|
||
**Core strategies:**
|
||
1. **Token counting:** Always count tokens before API calls (tiktoken)
|
||
2. **Budgeting:** Allocate tokens to system, context, query, output
|
||
3. **Chunking:** Fixed-size, semantic, or hierarchical for long documents
|
||
4. **Truncation:** Middle-out, extractive, or structure-aware
|
||
5. **Conversation management:** Sliding window, token-based, or summarization
|
||
6. **RAG budgeting:** Dynamic retrieval, re-ranking with budget constraints
|
||
|
||
**Optimization:**
|
||
- Cost: Compression, smaller models, efficient retrieval
|
||
- Latency: Reduce context, parallel processing, streaming
|
||
|
||
**Implementation checklist:**
|
||
1. ✓ Count tokens with tiktoken (not character/word counts)
|
||
2. ✓ Check against model-specific limits
|
||
3. ✓ Allocate token budget for multi-component systems
|
||
4. ✓ Choose appropriate strategy (chunking, truncation, summarization)
|
||
5. ✓ Manage conversation context proactively
|
||
6. ✓ Monitor token usage, cost, and latency
|
||
7. ✓ Test with realistic data (long documents, long conversations)
|
||
|
||
Context is finite. Manage it deliberately, or face crashes, quality degradation, and cost overruns.
|