gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/context-window-management.md


# Context Window Management Skill

## When to Use This Skill

Use this skill when:
- Processing documents longer than model context limit
- Building multi-turn conversational agents
- Implementing RAG systems with retrieved context
- Handling user inputs of unknown length
- Managing long-running conversations (customer support, assistants)
- Optimizing cost and latency for context-heavy applications

**When NOT to use:** Short, fixed-length inputs guaranteed to fit in context (e.g., tweet classification, short form filling).

## Core Principle

**Context is finite. Managing it is mandatory.**

LLM context windows have hard limits:
- GPT-3.5-turbo: 4k tokens (~3k words)
- GPT-3.5-turbo-16k: 16k tokens (~12k words)
- GPT-4: 8k tokens (~6k words)
- GPT-4-turbo: 128k tokens (~96k words)
- Claude 3 Sonnet: 200k tokens (~150k words)

Exceeding these limits = API crash. No graceful degradation. Token counting and management are not optional.

**Formula:** Token counting (prevent overflow) + Budgeting (allocate efficiently) + Management strategy (truncation/chunking/summarization) = Robust context handling.

## Context Management Framework

```
┌──────────────────────────────────────────────────┐
│          1. Count Tokens                         │
│  tiktoken, model-specific encoding               │
└────────────┬─────────────────────────────────────┘
             │
             ▼
┌──────────────────────────────────────────────────┐
│          2. Check Against Limits                 │
│  Model-specific context windows                  │
└────────────┬─────────────────────────────────────┘
             │
             ▼
┌──────────────────────────────────────────────────┐
│          3. Token Budget Allocation              │
│  System + Context + Query + Output               │
└────────────┬─────────────────────────────────────┘
             │
             ▼
        ┌────┴────┐
        │ Fits?   │
        └────┬────┘
      ┌──────┴──────┐
      │ Yes         │ No
      ▼             ▼
 ┌─────────┐   ┌─────────────────────┐
 │ Proceed │   │ Choose Strategy:     │
 └─────────┘   │ • Chunking           │
               │ • Truncation         │
               │ • Summarization      │
               │ • Larger model       │
               │ • Compression        │
               └─────────┬───────────┘
                         │
                         ▼
               ┌──────────────────┐
               │ Apply & Validate │
               └──────────────────┘
```

## Part 1: Token Counting

### Why Token Counting Matters

LLMs tokenize text (not characters or words). Token counts vary by:
- Language (English ~4 chars/token, Chinese ~2 chars/token)
- Content (code ~3 chars/token, prose ~4.5 chars/token)
- Model (different tokenizers)

**Character/word counts are unreliable estimates.**

### Tiktoken: OpenAI's Tokenizer

**Installation:**
```bash
pip install tiktoken
```

**Basic Usage:**

```python
import tiktoken

def count_tokens(text, model="gpt-3.5-turbo"):
    """
    Count tokens for given text and model.

    Args:
        text: String to tokenize
        model: Model name (determines tokenizer)

    Returns:
        Number of tokens
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        # Fallback for unknown models
        encoding = tiktoken.get_encoding("cl100k_base")  # GPT-4/3.5-turbo

    return len(encoding.encode(text))

# Examples
text = "Hello, how are you today?"
print(f"Tokens: {count_tokens(text)}")  # Output: 7 tokens

document = "Large document with 10,000 words..."
tokens = count_tokens(document, model="gpt-4")
print(f"Document tokens: {tokens:,}")  # Output: Document tokens: 13,421
```

**Encoding Types by Model:**

| Model | Encoding | Notes |
|-------|----------|-------|
| gpt-3.5-turbo | cl100k_base | Default for GPT-3.5/4 |
| gpt-4 | cl100k_base | Same as GPT-3.5 |
| gpt-4-turbo | cl100k_base | Same as GPT-3.5 |
| text-davinci-003 | p50k_base | Legacy GPT-3 |
| code-davinci-002 | p50k_base | Codex |

**Counting Chat Messages:**

```python
def count_message_tokens(messages, model="gpt-3.5-turbo"):
    """
    Count tokens in chat completion messages.

    Chat format adds overhead: role names, formatting tokens.
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = 0

    # Message formatting overhead (varies by model)
    tokens_per_message = 3  # Every message: <|im_start|>role\n, <|im_end|>\n
    tokens_per_name = 1  # If name field present

    for message in messages:
        tokens += tokens_per_message
        for key, value in message.items():
            tokens += len(encoding.encode(value))
            if key == "name":
                tokens += tokens_per_name

    tokens += 3  # Every reply starts with assistant message

    return tokens

# Example
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about Python."},
    {"role": "assistant", "content": "Python is a high-level programming language..."}
]

total_tokens = count_message_tokens(messages)
print(f"Total tokens: {total_tokens}")
```

**Token Estimation (Quick Approximation):**

```python
def estimate_tokens(text):
    """
    Quick estimation: ~4 characters per token for English prose.

    Not accurate for API calls! Use tiktoken for production.
    Useful for rough checks and dashboards.
    """
    return len(text) // 4

# Example
text = "This is a sample text for estimation."
estimated = estimate_tokens(text)
actual = count_tokens(text)
print(f"Estimated: {estimated}, Actual: {actual}")
# Output: Estimated: 9, Actual: 10 (close but not exact)
```


## Part 2: Model Context Limits and Budgeting

### Context Window Sizes

```python
MODEL_LIMITS = {
    # OpenAI GPT-3.5
    "gpt-3.5-turbo": 4_096,
    "gpt-3.5-turbo-16k": 16_384,

    # OpenAI GPT-4
    "gpt-4": 8_192,
    "gpt-4-32k": 32_768,
    "gpt-4-turbo": 128_000,
    "gpt-4-turbo-2024-04-09": 128_000,

    # Anthropic Claude
    "claude-3-opus": 200_000,
    "claude-3-sonnet": 200_000,
    "claude-3-haiku": 200_000,

    # Open source
    "llama-2-7b": 4_096,
    "llama-2-13b": 4_096,
    "llama-2-70b": 4_096,
    "mistral-7b": 8_192,
    "mixtral-8x7b": 32_768,
}

def get_context_limit(model):
    """Get context window size for model."""
    return MODEL_LIMITS.get(model, 4_096)  # Default: 4k
```

### Token Budget Allocation

For systems with multiple components (RAG, chat with history), allocate tokens:

```python
def calculate_token_budget(
    model="gpt-3.5-turbo",
    system_message_tokens=None,
    query_tokens=None,
    output_tokens=500,
    safety_margin=50
):
    """
    Calculate remaining budget for context (e.g., retrieved documents).

    Args:
        model: LLM model name
        system_message_tokens: Tokens in system message (if known)
        query_tokens: Tokens in user query (if known)
        output_tokens: Reserved tokens for model output
        safety_margin: Extra buffer to prevent edge cases

    Returns:
        Available tokens for context
    """
    total_limit = MODEL_LIMITS[model]

    # Reserve tokens
    reserved = (
        (system_message_tokens or 100) +  # System message (estimate if unknown)
        (query_tokens or 100) +           # User query (estimate if unknown)
        output_tokens +                   # Model response
        safety_margin                     # Safety buffer
    )

    context_budget = total_limit - reserved

    return {
        'total_limit': total_limit,
        'context_budget': context_budget,
        'reserved_system': system_message_tokens or 100,
        'reserved_query': query_tokens or 100,
        'reserved_output': output_tokens,
        'safety_margin': safety_margin
    }

# Example
budget = calculate_token_budget(
    model="gpt-3.5-turbo",
    system_message_tokens=50,
    query_tokens=20,
    output_tokens=500
)

print(f"Total limit: {budget['total_limit']:,}")
print(f"Context budget: {budget['context_budget']:,}")
# Output:
# Total limit: 4,096
# Context budget: 3,376 (can use for retrieved docs, chat history, etc.)
```

**RAG Token Budgeting:**

```python
def budget_for_rag(
    query,
    system_message="You are a helpful assistant. Answer using the provided context.",
    model="gpt-3.5-turbo",
    output_tokens=500
):
    """Calculate available tokens for retrieved documents in RAG."""
    system_tokens = count_tokens(system_message, model)
    query_tokens = count_tokens(query, model)

    budget = calculate_token_budget(
        model=model,
        system_message_tokens=system_tokens,
        query_tokens=query_tokens,
        output_tokens=output_tokens
    )

    return budget['context_budget']

# Example
query = "What is the company's return policy for defective products?"
available_tokens = budget_for_rag(query, model="gpt-3.5-turbo")
print(f"Available tokens for retrieved documents: {available_tokens}")
# Output: Available tokens for retrieved documents: 3,376

# This means we can retrieve ~3,376 tokens worth of documents
# At ~500 tokens/chunk, that's 6-7 document chunks
```


## Part 3: Chunking Strategies

When document exceeds context limit, split into chunks and process separately.

### Fixed-Size Chunking

**Simple approach:** Split into equal-sized chunks.

```python
def chunk_by_tokens(text, chunk_size=1000, overlap=200, model="gpt-3.5-turbo"):
    """
    Split text into fixed-size token chunks with overlap.

    Args:
        text: Text to chunk
        chunk_size: Target tokens per chunk
        overlap: Overlapping tokens between chunks (for continuity)
        model: Model for tokenization

    Returns:
        List of text chunks
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    chunks = []
    start = 0

    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        start += chunk_size - overlap  # Overlap for continuity

    return chunks

# Example
document = "Very long document with 10,000 tokens..." * 1000
chunks = chunk_by_tokens(document, chunk_size=1000, overlap=200)
print(f"Split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}: {count_tokens(chunk)} tokens")
```

**Pros:**
- Simple, predictable chunk sizes
- Works for any text

**Cons:**
- May split mid-sentence, mid-paragraph (poor semantic boundaries)
- Overlap creates redundancy
- No awareness of document structure

### Semantic Chunking

**Better approach:** Split at semantic boundaries (paragraphs, sections).

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_semantically(text, chunk_size=1000, overlap=200):
    """
    Split text at semantic boundaries (paragraphs, sentences).

    Uses LangChain's RecursiveCharacterTextSplitter which tries:
    1. Split by paragraphs (\n\n)
    2. If chunk still too large, split by sentences (. )
    3. If sentence still too large, split by words
    4. Last resort: split by characters
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size * 4,  # Approximate: 4 chars/token
        chunk_overlap=overlap * 4,
        separators=["\n\n", "\n", ". ", " ", ""],  # Priority order
        length_function=lambda text: count_tokens(text)  # Use actual token count
    )

    chunks = splitter.split_text(text)
    return chunks

# Example
document = """
# Introduction

This is the introduction to the document.
It contains several paragraphs of introductory material.

## Methods

The methods section describes the experimental procedure.
We used a randomized controlled trial with 100 participants.

## Results

The results show significant improvements in...
"""

chunks = chunk_semantically(document, chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({count_tokens(chunk)} tokens):\n{chunk[:100]}...\n")
```

**Pros:**
- Respects semantic boundaries (complete paragraphs, sentences)
- Better context preservation
- More readable chunks

**Cons:**
- Chunk sizes vary (some may be too large)
- More complex implementation

### Hierarchical Chunking (Map-Reduce)

**Best for summarization:** Summarize chunks, then summarize summaries.

```python
def hierarchical_summarization(document, chunk_size=3000, model="gpt-3.5-turbo"):
    """
    Summarize long document using map-reduce approach.

    1. Split document into chunks (MAP)
    2. Summarize each chunk individually
    3. Combine chunk summaries (REDUCE)
    4. Generate final summary from combined summaries
    """
    import openai

    # Step 1: Chunk document
    chunks = chunk_semantically(document, chunk_size=chunk_size)
    print(f"Split into {len(chunks)} chunks")

    # Step 2: Summarize each chunk (MAP)
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "Summarize the following text concisely."},
                {"role": "user", "content": chunk}
            ],
            temperature=0
        )
        summary = response.choices[0].message.content
        chunk_summaries.append(summary)
        print(f"Chunk {i+1} summary: {summary[:100]}...")

    # Step 3: Combine summaries (REDUCE)
    combined_summaries = "\n\n".join(chunk_summaries)

    # Step 4: Generate final summary
    final_response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "Synthesize the following summaries into a comprehensive final summary."},
            {"role": "user", "content": combined_summaries}
        ],
        temperature=0
    )

    final_summary = final_response.choices[0].message.content
    return final_summary

# Example
long_document = "Research paper with 50,000 tokens..." * 100
summary = hierarchical_summarization(long_document, chunk_size=3000)
print(f"Final summary:\n{summary}")
```

**Pros:**
- Handles arbitrarily long documents
- Preserves information across entire document
- Parallelizable (summarize chunks concurrently)

**Cons:**
- More API calls (higher cost)
- Information loss in successive summarizations
- Slower than single-pass


## Part 4: Intelligent Truncation Strategies

When chunking isn't appropriate (e.g., single-pass QA), truncate intelligently.

### Strategy 1: Truncate from Middle (Preserve Intro + Conclusion)

```python
def truncate_middle(text, max_tokens=3500, model="gpt-3.5-turbo"):
    """
    Keep beginning and end, truncate middle.

    Useful for documents with important intro (context) and conclusion (findings).
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    if len(tokens) <= max_tokens:
        return text  # Fits, no truncation needed

    # Allocate: 40% beginning, 40% end, 20% lost in middle
    keep_start = int(max_tokens * 0.4)
    keep_end = int(max_tokens * 0.4)

    start_tokens = tokens[:keep_start]
    end_tokens = tokens[-keep_end:]

    # Add marker showing truncation
    truncation_marker = encoding.encode("\n\n[... middle section truncated ...]\n\n")

    truncated_tokens = start_tokens + truncation_marker + end_tokens
    return encoding.decode(truncated_tokens)

# Example
document = """
Introduction: This paper presents a new approach to X.
Our hypothesis is that Y improves performance by 30%.

[... 10,000 tokens of methods, experiments, detailed results ...]

Conclusion: We demonstrated that Y improves performance by 31%,
confirming our hypothesis. Future work will explore Z.
"""

truncated = truncate_middle(document, max_tokens=500)
print(truncated)
# Output:
# Introduction: This paper presents...
# [... middle section truncated ...]
# Conclusion: We demonstrated that Y improves...
```

### Strategy 2: Truncate from Beginning (Keep Recent Context)

```python
def truncate_from_start(text, max_tokens=3500, model="gpt-3.5-turbo"):
    """
    Keep end, discard beginning.

    Useful for logs, conversations where recent context is most important.
    """
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    if len(tokens) <= max_tokens:
        return text

    # Keep last N tokens
    truncated_tokens = tokens[-max_tokens:]
    return encoding.decode(truncated_tokens)

# Example: Chat logs
conversation = """
[Turn 1 - 2 hours ago] User: How do I reset my password?
[Turn 2] Bot: Go to Settings > Security > Reset Password.
[... 50 turns ...]
[Turn 51 - just now] User: What was that password reset link again?
"""

truncated = truncate_from_start(conversation, max_tokens=200)
print(truncated)
# Output: [Turn 48] ... [Turn 51 - just now] User: What was that password reset link again?
```

### Strategy 3: Extractive Truncation (Keep Most Relevant)

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def extractive_truncation(document, query, max_tokens=3000, model="gpt-3.5-turbo"):
    """
    Keep sentences most relevant to query.

    Uses TF-IDF similarity to rank sentences by relevance to query.
    """
    # Split into sentences
    sentences = document.split('. ')

    # Calculate TF-IDF similarity to query
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([query] + sentences)
    query_vec = vectors[0]
    sentence_vecs = vectors[1:]

    # Similarity scores
    similarities = cosine_similarity(query_vec, sentence_vecs)[0]

    # Rank sentences by similarity
    ranked_indices = np.argsort(similarities)[::-1]

    # Select sentences until token budget exhausted
    selected_sentences = []
    token_count = 0
    encoding = tiktoken.encoding_for_model(model)

    for idx in ranked_indices:
        sentence = sentences[idx] + '. '
        sentence_tokens = len(encoding.encode(sentence))

        if token_count + sentence_tokens <= max_tokens:
            selected_sentences.append((idx, sentence))
            token_count += sentence_tokens
        else:
            break

    # Sort selected sentences by original order (maintain flow)
    selected_sentences.sort(key=lambda x: x[0])

    return ''.join([sent for _, sent in selected_sentences])

# Example
document = """
The company was founded in 1995 in Seattle.
Our return policy allows returns within 30 days of purchase.
Products must be in original condition with tags attached.
Refunds are processed within 5-7 business days.
We offer free shipping on orders over $50.
The company has 500 employees worldwide.
"""

query = "What is the return policy?"

truncated = extractive_truncation(document, query, max_tokens=150)
print(truncated)
# Output: Our return policy allows returns within 30 days. Products must be in original condition. Refunds processed within 5-7 days.
```


## Part 5: Conversation Context Management

Multi-turn conversations require active context management to prevent unbounded growth.

### Strategy 1: Sliding Window

**Keep last N turns.**

```python
class SlidingWindowChatbot:
    def __init__(self, model="gpt-3.5-turbo", max_history=10):
        """
        Chatbot with sliding window context.

        Args:
            model: LLM model
            max_history: Maximum conversation turns to keep (user+assistant pairs)
        """
        self.model = model
        self.max_history = max_history
        self.system_message = {"role": "system", "content": "You are a helpful assistant."}
        self.messages = [self.system_message]

    def chat(self, user_message):
        """Add message, generate response, manage context."""
        import openai

        # Add user message
        self.messages.append({"role": "user", "content": user_message})

        # Apply sliding window (keep system + last N*2 messages)
        if len(self.messages) > (self.max_history * 2 + 1):  # +1 for system message
            self.messages = [self.system_message] + self.messages[-(self.max_history * 2):]

        # Generate response
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=self.messages
        )

        assistant_message = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_message})

        return assistant_message

# Example
bot = SlidingWindowChatbot(max_history=5)  # Keep last 5 turns

for turn in range(20):
    user_msg = input("You: ")
    response = bot.chat(user_msg)
    print(f"Bot: {response}")

    # Context automatically managed: always ≤ 11 messages (1 system + 5*2 user/assistant)
```

**Pros:**
- Simple, predictable
- Constant memory/cost
- Recent context preserved

**Cons:**
- Loses old context (user may reference earlier conversation)
- Fixed window may be too small or too large

### Strategy 2: Token-Based Truncation

**Keep messages until token budget exhausted.**

```python
class TokenBudgetChatbot:
    def __init__(self, model="gpt-3.5-turbo", max_tokens=3000):
        """
        Chatbot with token-based context management.

        Keeps messages until token budget exhausted (newest to oldest).
        """
        self.model = model
        self.max_tokens = max_tokens
        self.system_message = {"role": "system", "content": "You are a helpful assistant."}
        self.messages = [self.system_message]

    def chat(self, user_message):
        import openai

        # Add user message
        self.messages.append({"role": "user", "content": user_message})

        # Token management: keep system + recent messages within budget
        total_tokens = count_message_tokens(self.messages, self.model)

        while total_tokens > self.max_tokens and len(self.messages) > 2:
            # Remove oldest message (after system message)
            removed = self.messages.pop(1)
            total_tokens = count_message_tokens(self.messages, self.model)

        # Generate response
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=self.messages
        )

        assistant_message = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_message})

        return assistant_message

# Example
bot = TokenBudgetChatbot(max_tokens=2000)

for turn in range(20):
    user_msg = input("You: ")
    response = bot.chat(user_msg)
    print(f"Bot: {response}")
    print(f"Context tokens: {count_message_tokens(bot.messages)}")
```

**Pros:**
- Adaptive to message length (long messages = fewer kept, short messages = more kept)
- Precise budget control

**Cons:**
- Removes from beginning (loses early context)

### Strategy 3: Summarization + Sliding Window

**Best of both: Summarize old context, keep recent verbatim.**

```python
class SummarizingChatbot:
    def __init__(self, model="gpt-3.5-turbo", max_recent=5, summarize_threshold=10):
        """
        Chatbot with summarization + sliding window.

        When conversation exceeds threshold, summarize old turns and keep recent verbatim.

        Args:
            model: LLM model
            max_recent: Recent turns to keep verbatim
            summarize_threshold: Turns before summarizing old context
        """
        self.model = model
        self.max_recent = max_recent
        self.summarize_threshold = summarize_threshold
        self.system_message = {"role": "system", "content": "You are a helpful assistant."}
        self.messages = [self.system_message]
        self.summary = None  # Stores summary of old context

    def summarize_old_context(self):
        """Summarize older messages (beyond recent window)."""
        import openai

        # Messages to summarize: after system, before recent window
        num_messages = len(self.messages) - 1  # Exclude system message
        if num_messages <= self.summarize_threshold:
            return  # Not enough history yet

        # Split: old (to summarize) vs recent (keep verbatim)
        old_messages = self.messages[1:-(self.max_recent*2)]  # Exclude system + recent

        if not old_messages:
            return

        # Format for summarization
        conversation_text = "\n".join([
            f"{msg['role']}: {msg['content']}" for msg in old_messages
        ])

        # Generate summary
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "Summarize the following conversation concisely, capturing key information, user goals, and important context."},
                {"role": "user", "content": conversation_text}
            ],
            temperature=0
        )

        self.summary = response.choices[0].message.content

        # Update messages: system + summary + recent
        recent_messages = self.messages[-(self.max_recent*2):]
        summary_message = {
            "role": "system",
            "content": f"Previous conversation summary: {self.summary}"
        }

        self.messages = [self.system_message, summary_message] + recent_messages

    def chat(self, user_message):
        import openai

        # Add user message
        self.messages.append({"role": "user", "content": user_message})

        # Check if summarization needed
        num_turns = (len(self.messages) - 1) // 2  # Exclude system message
        if num_turns >= self.summarize_threshold:
            self.summarize_old_context()

        # Generate response
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=self.messages
        )

        assistant_message = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_message})

        return assistant_message

# Example
bot = SummarizingChatbot(max_recent=5, summarize_threshold=10)

# Long conversation
for turn in range(25):
    user_msg = input("You: ")
    response = bot.chat(user_msg)
    print(f"Bot: {response}")

    # After turn 10, old context (turns 1-5) summarized, turns 6-10+ kept verbatim
```

**Pros:**
- Preserves full conversation history (in summary form)
- Recent context verbatim (maintains fluency)
- Bounded token usage

**Cons:**
- Extra API call for summarization (cost)
- Information loss in summary
- More complex


## Part 6: RAG Context Management

RAG systems retrieve documents and include in context. Token budgeting is critical.

### Dynamic Document Retrieval (Budget-Aware)

```python
def retrieve_with_token_budget(
    query,
    documents,
    embeddings,
    model="gpt-3.5-turbo",
    output_tokens=500,
    max_docs=20
):
    """
    Retrieve documents dynamically based on token budget.

    Args:
        query: User query
        documents: List of document dicts [{"id": ..., "content": ...}, ...]
        embeddings: Pre-computed document embeddings
        model: LLM model
        output_tokens: Reserved for output
        max_docs: Maximum documents to consider

    Returns:
        Selected documents within token budget
    """
    from sentence_transformers import SentenceTransformer, util

    # Calculate available token budget
    available_tokens = budget_for_rag(query, model=model, output_tokens=output_tokens)

    # Retrieve top-k relevant documents (semantic search)
    query_embedding = SentenceTransformer('all-MiniLM-L6-v2').encode(query)
    similarities = util.cos_sim(query_embedding, embeddings)[0]
    top_indices = similarities.argsort(descending=True)[:max_docs]

    # Select documents until budget exhausted
    selected_docs = []
    token_count = 0

    for idx in top_indices:
        doc = documents[idx]
        doc_tokens = count_tokens(doc['content'], model)

        if token_count + doc_tokens <= available_tokens:
            selected_docs.append(doc)
            token_count += doc_tokens
        else:
            # Budget exhausted
            break

    return selected_docs, token_count

# Example
query = "What is our return policy?"
documents = [
    {"id": 1, "content": "Our return policy allows returns within 30 days..."},
    {"id": 2, "content": "Shipping is free on orders over $50..."},
    # ... 100 more documents
]

selected, tokens_used = retrieve_with_token_budget(
    query, documents, embeddings, model="gpt-3.5-turbo"
)

print(f"Selected {len(selected)} documents using {tokens_used} tokens")
# Output: Selected 7 documents using 3,280 tokens (within budget)
```

### Chunk Re-Ranking with Token Budget

```python
def rerank_and_budget(query, chunks, model="gpt-3.5-turbo", max_tokens=3000):
    """
    Over-retrieve, re-rank, then select top chunks within token budget.

    1. Retrieve k=20 candidates (coarse retrieval)
    2. Re-rank with cross-encoder (fine-grained scoring)
    3. Select top chunks until budget exhausted
    """
    from sentence_transformers import CrossEncoder

    # Re-rank with cross-encoder
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    pairs = [[query, chunk['content']] for chunk in chunks]
    scores = cross_encoder.predict(pairs)

    # Sort by relevance
    ranked_chunks = sorted(
        zip(chunks, scores),
        key=lambda x: x[1],
        reverse=True
    )

    # Select until budget exhausted
    selected_chunks = []
    token_count = 0

    for chunk, score in ranked_chunks:
        chunk_tokens = count_tokens(chunk['content'], model)

        if token_count + chunk_tokens <= max_tokens:
            selected_chunks.append((chunk, score))
            token_count += chunk_tokens
        else:
            break

    return selected_chunks, token_count

# Example
chunks = [
    {"id": 1, "content": "Return policy: 30 days with receipt..."},
    {"id": 2, "content": "Shipping: Free over $50..."},
    # ... 18 more chunks
]

selected, tokens = rerank_and_budget(query, chunks, max_tokens=3000)
print(f"Selected {len(selected)} chunks, {tokens} tokens")
```


## Part 7: Cost and Performance Optimization

Context management affects cost and latency.

### Cost Optimization

```python
def calculate_cost(tokens, model="gpt-3.5-turbo"):
    """
    Calculate API cost based on token count.

    Pricing (as of 2024):
    - GPT-3.5-turbo: $0.002 per 1k tokens (input + output)
    - GPT-4: $0.03 per 1k input, $0.06 per 1k output
    - GPT-4-turbo: $0.01 per 1k input, $0.03 per 1k output
    """
    pricing = {
        "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
        "gpt-3.5-turbo-16k": {"input": 0.003, "output": 0.004},
        "gpt-4": {"input": 0.03, "output": 0.06},
        "gpt-4-turbo": {"input": 0.01, "output": 0.03},
    }

    rates = pricing.get(model, {"input": 0.002, "output": 0.002})
    input_cost = (tokens / 1000) * rates["input"]

    return input_cost

# Example: Cost comparison
conversation_tokens = 3500
print(f"GPT-3.5: ${calculate_cost(conversation_tokens, 'gpt-3.5-turbo'):.4f}")
print(f"GPT-4: ${calculate_cost(conversation_tokens, 'gpt-4'):.4f}")
# Output:
# GPT-3.5: $0.0053
# GPT-4: $0.1050 (20× more expensive!)
```

**Cost optimization strategies:**
1. **Compression:** Summarize old context (reduce tokens)
2. **Smaller model:** Use GPT-3.5 instead of GPT-4 when possible
3. **Efficient retrieval:** Retrieve fewer, more relevant docs
4. **Caching:** Cache embeddings, avoid re-encoding

### Latency Optimization

```python
# Latency increases with context length
import time

def measure_latency(context_tokens, model="gpt-3.5-turbo"):
    """
    Rough latency estimates (actual varies by API load).

    Latency = Fixed overhead + (tokens × per-token time)
    """
    fixed_overhead_ms = 500  # API call, network
    time_per_token_ms = {
        "gpt-3.5-turbo": 0.3,  # ~300ms per 1k tokens
        "gpt-4": 1.0,          # ~1s per 1k tokens (slower)
    }

    per_token = time_per_token_ms.get(model, 0.5)
    latency_ms = fixed_overhead_ms + (context_tokens * per_token)

    return latency_ms

# Example
for tokens in [500, 2000, 5000, 10000]:
    latency = measure_latency(tokens, "gpt-3.5-turbo")
    print(f"{tokens:,} tokens: {latency:.0f}ms ({latency/1000:.1f}s)")
# Output:
# 500 tokens: 650ms (0.7s)
# 2,000 tokens: 1,100ms (1.1s)
# 5,000 tokens: 2,000ms (2.0s)
# 10,000 tokens: 3,500ms (3.5s)
```

**Latency optimization strategies:**
1. **Reduce context:** Keep only essential information
2. **Parallel processing:** Process chunks concurrently (map-reduce)
3. **Streaming:** Stream responses for perceived latency reduction
4. **Caching:** Cache frequent queries


## Part 8: Complete Implementation Example

**RAG System with Full Context Management:**

```python
import openai
import tiktoken
from sentence_transformers import SentenceTransformer, util

class ManagedRAGSystem:
    def __init__(
        self,
        model="gpt-3.5-turbo",
        embedding_model="all-MiniLM-L6-v2",
        max_docs=20,
        output_tokens=500
    ):
        self.model = model
        self.embedding_model = SentenceTransformer(embedding_model)
        self.max_docs = max_docs
        self.output_tokens = output_tokens

    def query(self, question, documents):
        """
        Query RAG system with full context management.

        Steps:
        1. Calculate token budget
        2. Retrieve relevant documents within budget
        3. Build context
        4. Generate response
        5. Return response with metadata
        """
        # Step 1: Calculate token budget
        system_message = "Answer the question using only the provided context."
        budget = calculate_token_budget(
            model=self.model,
            system_message_tokens=count_tokens(system_message),
            query_tokens=count_tokens(question),
            output_tokens=self.output_tokens
        )
        context_budget = budget['context_budget']

        # Step 2: Retrieve documents within budget
        query_embedding = self.embedding_model.encode(question)
        doc_embeddings = self.embedding_model.encode([doc['content'] for doc in documents])
        similarities = util.cos_sim(query_embedding, doc_embeddings)[0]
        top_indices = similarities.argsort(descending=True)[:self.max_docs]

        selected_docs = []
        token_count = 0

        for idx in top_indices:
            doc = documents[idx]
            doc_tokens = count_tokens(doc['content'], self.model)

            if token_count + doc_tokens <= context_budget:
                selected_docs.append(doc)
                token_count += doc_tokens
            else:
                break

        # Step 3: Build context
        context = "\n\n".join([doc['content'] for doc in selected_docs])

        # Step 4: Generate response
        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]

        response = openai.ChatCompletion.create(
            model=self.model,
            messages=messages,
            temperature=0
        )

        answer = response.choices[0].message.content

        # Step 5: Return with metadata
        return {
            'answer': answer,
            'num_docs_retrieved': len(selected_docs),
            'context_tokens': token_count,
            'total_tokens': response.usage.total_tokens,
            'cost': calculate_cost(response.usage.total_tokens, self.model)
        }

# Example usage
rag = ManagedRAGSystem(model="gpt-3.5-turbo")

documents = [
    {"id": 1, "content": "Our return policy allows returns within 30 days of purchase with receipt."},
    {"id": 2, "content": "Refunds are processed within 5-7 business days."},
    # ... more documents
]

result = rag.query("What is the return policy?", documents)

print(f"Answer: {result['answer']}")
print(f"Retrieved: {result['num_docs_retrieved']} documents")
print(f"Context tokens: {result['context_tokens']}")
print(f"Total tokens: {result['total_tokens']}")
print(f"Cost: ${result['cost']:.4f}")
```


## Summary

**Context window management is mandatory for production LLM systems.**

**Core strategies:**
1. **Token counting:** Always count tokens before API calls (tiktoken)
2. **Budgeting:** Allocate tokens to system, context, query, output
3. **Chunking:** Fixed-size, semantic, or hierarchical for long documents
4. **Truncation:** Middle-out, extractive, or structure-aware
5. **Conversation management:** Sliding window, token-based, or summarization
6. **RAG budgeting:** Dynamic retrieval, re-ranking with budget constraints

**Optimization:**
- Cost: Compression, smaller models, efficient retrieval
- Latency: Reduce context, parallel processing, streaming

**Implementation checklist:**
1. ✓ Count tokens with tiktoken (not character/word counts)
2. ✓ Check against model-specific limits
3. ✓ Allocate token budget for multi-component systems
4. ✓ Choose appropriate strategy (chunking, truncation, summarization)
5. ✓ Manage conversation context proactively
6. ✓ Monitor token usage, cost, and latency
7. ✓ Test with realistic data (long documents, long conversations)

Context is finite. Manage it deliberately, or face crashes, quality degradation, and cost overruns.