231 lines
6.8 KiB
Markdown
231 lines
6.8 KiB
Markdown
# Phase 3: Iterative Improvement Examples
|
|
|
|
Examples of before/after prompt comparisons and result reports.
|
|
|
|
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 3](./workflow_phase3.md) | [Prompt Optimization](./prompt_optimization.md)
|
|
|
|
---
|
|
|
|
## Phase 3: Iterative Improvement Examples
|
|
|
|
### Example 3.1: Before/After Prompt Comparison
|
|
|
|
**Node**: analyze_intent
|
|
|
|
#### Before (Baseline)
|
|
|
|
```python
|
|
def analyze_intent(state: GraphState) -> GraphState:
|
|
llm = ChatAnthropic(
|
|
model="claude-3-5-sonnet-20241022",
|
|
temperature=1.0
|
|
)
|
|
|
|
messages = [
|
|
SystemMessage(content="You are an intent analyzer. Analyze user input."),
|
|
HumanMessage(content=f"Analyze: {state['user_input']}")
|
|
]
|
|
|
|
response = llm.invoke(messages)
|
|
state["intent"] = response.content
|
|
return state
|
|
```
|
|
|
|
**Issues**:
|
|
- Ambiguous instructions
|
|
- No few-shot examples
|
|
- Free text output
|
|
- High temperature
|
|
|
|
**Result**: Accuracy 75%
|
|
|
|
#### After (Iteration 1)
|
|
|
|
```python
|
|
def analyze_intent(state: GraphState) -> GraphState:
|
|
llm = ChatAnthropic(
|
|
model="claude-3-5-sonnet-20241022",
|
|
temperature=0.3 # Lower temperature for classification tasks
|
|
)
|
|
|
|
# Clear classification categories and few-shot examples
|
|
system_prompt = """You are an intent classifier for a customer support chatbot.
|
|
|
|
Classify user input into one of these categories:
|
|
- "product_inquiry": Questions about products or services
|
|
- "technical_support": Technical issues or troubleshooting
|
|
- "billing": Payment, invoicing, or billing questions
|
|
- "general": General questions or chitchat
|
|
|
|
Output ONLY a valid JSON object with this structure:
|
|
{
|
|
"intent": "<category>",
|
|
"confidence": <0.0-1.0>,
|
|
"reasoning": "<brief explanation>"
|
|
}
|
|
|
|
Examples:
|
|
|
|
Input: "How much does the premium plan cost?"
|
|
Output: {"intent": "product_inquiry", "confidence": 0.95, "reasoning": "Question about product pricing"}
|
|
|
|
Input: "I can't log into my account"
|
|
Output: {"intent": "technical_support", "confidence": 0.9, "reasoning": "Authentication issue"}
|
|
|
|
Input: "Why was I charged twice?"
|
|
Output: {"intent": "billing", "confidence": 0.95, "reasoning": "Question about billing charges"}
|
|
|
|
Input: "Hello, how are you?"
|
|
Output: {"intent": "general", "confidence": 0.85, "reasoning": "General greeting"}
|
|
|
|
Input: "What's the return policy?"
|
|
Output: {"intent": "product_inquiry", "confidence": 0.9, "reasoning": "Question about product policy"}
|
|
"""
|
|
|
|
messages = [
|
|
SystemMessage(content=system_prompt),
|
|
HumanMessage(content=f"Input: {state['user_input']}\nOutput:")
|
|
]
|
|
|
|
response = llm.invoke(messages)
|
|
|
|
# JSON parsing (with error handling)
|
|
try:
|
|
intent_data = json.loads(response.content)
|
|
state["intent"] = intent_data["intent"]
|
|
state["confidence"] = intent_data["confidence"]
|
|
except json.JSONDecodeError:
|
|
# Fallback
|
|
state["intent"] = "general"
|
|
state["confidence"] = 0.5
|
|
|
|
return state
|
|
```
|
|
|
|
**Improvements**:
|
|
- ✅ temperature: 1.0 → 0.3
|
|
- ✅ Clear classification categories (4 intents)
|
|
- ✅ Few-shot examples (5 added)
|
|
- ✅ JSON output format (structured output)
|
|
- ✅ Error handling (fallback for JSON parsing failures)
|
|
|
|
**Result**: Accuracy 86% (+11%)
|
|
|
|
### Example 3.2: Prioritization Matrix
|
|
|
|
```markdown
|
|
## Improvement Prioritization Matrix
|
|
|
|
| Node | Impact | Feasibility | Implementation Cost | Total Score | Priority |
|
|
| ----------------- | ------------ | ------------ | ------------------- | ----------- | -------- |
|
|
| analyze_intent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 14/15 | 1st |
|
|
| generate_response | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 12/15 | 2nd |
|
|
| retrieve_context | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 8/15 | 3rd |
|
|
|
|
### Detailed Analysis
|
|
|
|
#### 1st: analyze_intent Node
|
|
|
|
- **Impact**: ⭐⭐⭐⭐⭐
|
|
- Direct impact on Accuracy (accounts for 60% of -15% gap)
|
|
- Also affects downstream nodes (chain errors from misclassification)
|
|
|
|
- **Feasibility**: ⭐⭐⭐⭐⭐
|
|
- Improvement expected from few-shot examples
|
|
- Similar cases show +10-15% improvement
|
|
|
|
- **Implementation Cost**: ⭐⭐⭐⭐
|
|
- Implementation time: 30-60 minutes
|
|
- Testing time: 30 minutes
|
|
- Risk: Low
|
|
|
|
**Iteration 1 target**: analyze_intent node
|
|
|
|
#### 2nd: generate_response Node
|
|
|
|
- **Impact**: ⭐⭐⭐⭐
|
|
- Main contributor to Latency and Cost (over 70% of total)
|
|
- Small direct impact on Accuracy
|
|
|
|
- **Feasibility**: ⭐⭐⭐⭐
|
|
- max_tokens limit ensures improvement
|
|
- Quality can be maintained with conciseness instructions
|
|
|
|
- **Implementation Cost**: ⭐⭐⭐⭐
|
|
- Implementation time: 20-30 minutes
|
|
- Testing time: 30 minutes
|
|
- Risk: Low
|
|
|
|
**Iteration 2 target**: generate_response node
|
|
```
|
|
|
|
### Example 3.3: Iteration Results Report
|
|
|
|
```markdown
|
|
# Iteration 1 Evaluation Results
|
|
|
|
Execution Date/Time: 2024-11-24 12:00:00
|
|
Changes: analyze_intent node optimization
|
|
|
|
## Result Comparison
|
|
|
|
| Metric | Baseline | Iteration 1 | Change | Change Rate | Target | Achievement |
|
|
| ------------ | -------- | ----------- | ---------- | ----------- | ------ | ----------- |
|
|
| **Accuracy** | 75.0% | **86.0%** | **+11.0%** | +14.7% | 90.0% | 95.6% |
|
|
| **Latency** | 2.5s | 2.4s | -0.1s | -4.0% | 2.0s | 80.0% |
|
|
| **Cost/req** | $0.015 | $0.014 | -$0.001 | -6.7% | $0.010 | 71.4% |
|
|
|
|
## Detailed Analysis
|
|
|
|
### Accuracy Improvement
|
|
|
|
- **Improvement**: +11.0% (75.0% → 86.0%)
|
|
- **Remaining gap**: 4.0% (Target 90.0%)
|
|
- **Improved cases**: Intent classification errors reduced from 12 → 3 cases
|
|
- **Still needs improvement**: Context understanding cases (5 cases)
|
|
|
|
### Slight Latency Improvement
|
|
|
|
- **Improvement**: -0.1s (2.5s → 2.4s)
|
|
- **Main factor**: analyze_intent output became more concise due to lower temperature
|
|
- **Remaining bottleneck**: generate_response (average 1.8s)
|
|
|
|
### Slight Cost Reduction
|
|
|
|
- **Reduction**: -$0.001 (6.7% reduction)
|
|
- **Factor**: analyze_intent output token reduction
|
|
- **Main cost**: generate_response still accounts for 73%
|
|
|
|
## Statistical Significance
|
|
|
|
- **t-test**: p < 0.01 ✅ (statistically significant)
|
|
- **Effect size**: Cohen's d = 2.3 (large effect)
|
|
- **Confidence interval**: [83.9%, 88.1%] (95% CI)
|
|
|
|
## Next Iteration Strategy
|
|
|
|
### Priority 1: Optimize generate_response
|
|
|
|
- **Goal**: Latency from 1.8s → 1.4s, Cost from $0.011 → $0.007
|
|
- **Approach**:
|
|
1. Add conciseness instructions
|
|
2. Limit max_tokens to 500
|
|
3. Adjust temperature from 0.7 → 0.5
|
|
|
|
### Priority 2: Final 4% Accuracy improvement
|
|
|
|
- **Goal**: 86.0% → 90.0% or higher
|
|
- **Approach**: Improve context understanding (retrieve_context node)
|
|
|
|
## Decision
|
|
|
|
✅ **Continue** → Proceed to Iteration 2
|
|
|
|
Reasons:
|
|
- Accuracy improved significantly but still hasn't reached target
|
|
- Latency and Cost still have room for improvement
|
|
- Clear improvement strategy is in place
|
|
```
|
|
|
|
---
|