Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:45:53 +08:00
commit bf626a95e2
68 changed files with 15159 additions and 0 deletions

View File

@@ -0,0 +1,83 @@
# LangGraph Fine-Tune Skill
A comprehensive skill for iteratively optimizing prompts and processing logic in LangGraph applications based on evaluation criteria.
## Overview
The fine-tune skill helps you improve the performance of existing LangGraph applications through systematic prompt optimization without modifying the graph structure (nodes, edges configuration).
## Key Features
- **Iterative Optimization**: Data-driven improvement cycles with measurable results
- **Graph Structure Preservation**: Only optimize prompts and node logic, not the graph architecture
- **Statistical Evaluation**: Multiple runs with statistical analysis for reliable results
- **MCP Integration**: Leverages Serena MCP for codebase analysis and target identification
## When to Use
- LLM output quality needs improvement
- Response latency is too high
- Cost optimization is required
- Error rates need reduction
- Prompt engineering improvements are expected to help
## 4-Phase Workflow
### Phase 1: Preparation and Analysis
Understand optimization targets and current state.
- Load objectives from `.langgraph-master/fine-tune.md`
- Identify optimization targets using Serena MCP
- Create prioritized optimization target list
### Phase 2: Baseline Evaluation
Quantitatively measure current performance.
- Prepare evaluation environment (test cases, scripts)
- Measure baseline (3-5 runs recommended)
- Analyze results and identify problems
### Phase 3: Iterative Improvement
Data-driven incremental improvement cycle.
- Prioritize improvement areas by impact
- Implement prompt optimizations
- Re-evaluate under same conditions
- Compare results and decide next steps
- Repeat until goals are achieved
### Phase 4: Completion and Documentation
Record achievements and provide recommendations.
- Create final evaluation report
- Commit code changes
- Update documentation
## Key Optimization Techniques
| Technique | Expected Impact |
| --------------------------------- | --------------------------- |
| Few-Shot Examples | Accuracy +10-20% |
| Structured Output Format | Parsing errors -90% |
| Temperature/Max Tokens Adjustment | Cost -20-40% |
| Model Selection Optimization | Cost -40-60% |
| Prompt Caching | Cost -50-90% (on cache hit) |
## Best Practices
1. **Start Small**: Begin with the most impactful node
2. **Measurement-Driven**: Always quantify before and after improvements
3. **Incremental Changes**: Validate one change at a time
4. **Document Everything**: Record reasons and results for each change
5. **Iterate**: Continue improving until goals are achieved
## Important Constraints
- **Preserve Graph Structure**: Do not add/remove nodes or edges
- **Maintain Data Flow**: Do not change data flow between nodes
- **Keep State Schema**: Maintain the existing state schema
- **Evaluation Consistency**: Use same test cases and metrics throughout

153
skills/fine-tune/SKILL.md Normal file
View File

@@ -0,0 +1,153 @@
---
name: fine-tune
description: Use when you need to fine-tune(ファインチューニング) and optimize LangGraph applications based on evaluation criteria. This skill performs iterative prompt optimization for LangGraph nodes without changing the graph structure.
---
# LangGraph Application Fine-Tuning Skill
A skill for iteratively optimizing prompts and processing logic in each node of a LangGraph application based on evaluation criteria.
## 📋 Overview
This skill executes the following process to improve the performance of existing LangGraph applications:
1. **Load Objectives**: Retrieve optimization goals and evaluation criteria from `.langgraph-master/fine-tune.md` (if this file doesn't exist, help the user create it based on their requirements)
2. **Identify Optimization Targets**: Extract nodes containing LLM prompts using Serena MCP (if Serena MCP is unavailable, investigate the codebase using ls, read, etc.)
3. **Baseline Evaluation**: Measure current performance through multiple runs
4. **Implement Improvements**: Identify the most effective improvement areas and optimize prompts and processing logic
5. **Re-evaluation**: Measure performance after improvements
6. **Iteration**: Repeat steps 4-5 until goals are achieved
**Important Constraint**: Only optimize prompts and processing logic within each node without modifying the graph structure (nodes, edges configuration).
## 🎯 When to Use This Skill
Use this skill in the following situations:
1. **When performance improvement of existing applications is needed**
- Want to improve LLM output quality
- Want to improve response speed
- Want to reduce error rate
2. **When evaluation criteria are clear**
- Optimization goals are defined in `.langgraph-master/fine-tune.md`
- Quantitative evaluation methods are established
3. **When improvements through prompt engineering are expected**
- Improvements are likely with clearer LLM instructions
- Adding few-shot examples would be effective
- Output format adjustment is needed
## 📖 Fine-Tuning Workflow Overview
### Phase 1: Preparation and Analysis
**Purpose**: Understand optimization targets and current state
**Main Steps**:
1. Load objective setting file (`.langgraph-master/fine-tune.md`)
2. Identify optimization targets (Serena MCP or manual code investigation)
3. Create optimization target list (evaluate improvement potential for each node)
→ See [workflow.md](workflow.md#phase-1-preparation-and-analysis) for details
### Phase 2: Baseline Evaluation
**Purpose**: Quantitatively measure current performance
**Main Steps**: 4. Prepare evaluation environment (test cases, evaluation scripts) 5. Baseline measurement (recommended: 3-5 runs) 6. Analyze baseline results (identify problems)
**Important**: When evaluation programs are needed, create evaluation code in a specific subdirectory (users may specify the directory).
→ See [workflow.md](workflow.md#phase-2-baseline-evaluation) and [evaluation.md](evaluation.md) for details
### Phase 3: Iterative Improvement
**Purpose**: Data-driven incremental improvement
**Main Steps**: 7. Prioritization (select the most impactful improvement area) 8. Implement improvements (prompt optimization, parameter tuning) 9. Post-improvement evaluation (re-evaluate under the same conditions) 10. Compare and analyze results (measure improvement effects) 11. Decide whether to continue iteration (repeat until goals are achieved)
→ See [workflow.md](workflow.md#phase-3-iterative-improvement) and [prompt_optimization.md](prompt_optimization.md) for details
### Phase 4: Completion and Documentation
**Purpose**: Record achievements and provide future recommendations
**Main Steps**: 12. Create final evaluation report (improvement content, results, recommendations) 13. Code commit and documentation update
→ See [workflow.md](workflow.md#phase-4-completion-and-documentation) for details
## 🔧 Tools and Technologies Used
### MCP Server Utilization
- **Serena MCP**: Codebase analysis and optimization target identification
- `find_symbol`: Search for LLM clients
- `find_referencing_symbols`: Identify prompt construction locations
- `get_symbols_overview`: Understand node structure
- **Sequential MCP**: Complex analysis and decision making
- Determine improvement priorities
- Analyze evaluation results
- Plan next actions
### Key Optimization Techniques
1. **Few-Shot Examples**: Accuracy +10-20%
2. **Structured Output Format**: Parsing errors -90%
3. **Temperature/Max Tokens Adjustment**: Cost -20-40%
4. **Model Selection Optimization**: Cost -40-60%
5. **Prompt Caching**: Cost -50-90% (on cache hit)
→ See [prompt_optimization.md](prompt_optimization.md) for details
## 📚 Related Documentation
Detailed guidelines and best practices:
- **[workflow.md](workflow.md)** - Fine-tuning workflow details (execution procedures and code examples for each phase)
- **[evaluation.md](evaluation.md)** - Evaluation methods and best practices (metric calculation, statistical analysis, test case design)
- **[prompt_optimization.md](prompt_optimization.md)** - Prompt optimization techniques (10 practical methods and priorities)
- **[examples.md](examples.md)** - Practical examples collection (copy-and-paste ready code examples and template collection)
## ⚠️ Important Notes
1. **Preserve Graph Structure**
- Do not add or remove nodes or edges
- Do not change data flow between nodes
- Maintain state schema
2. **Evaluation Consistency**
- Use the same test cases
- Measure with the same evaluation metrics
- Run multiple times to confirm statistically significant improvements
3. **Cost Management**
- Consider evaluation execution costs
- Adjust sample size as needed
- Be mindful of API rate limits
4. **Version Control**
- Git commit each iteration's changes
- Maintain rollback-capable state
- Record evaluation results
## 🎓 Fine-Tuning Best Practices
1. **Start Small**: Optimize from the most impactful node
2. **Measurement-Driven**: Always perform quantitative evaluation before and after improvements
3. **Incremental Improvement**: Validate one change at a time, not multiple simultaneously
4. **Documentation**: Record reasons and results for each change
5. **Iteration**: Continuously improve until goals are achieved
## 🔗 Reference Links
- [LangGraph Official Documentation](https://docs.langchain.com/oss/python/langgraph/overview)
- [Prompt Engineering Guide](https://www.promptingguide.ai/)

View File

@@ -0,0 +1,80 @@
# Evaluation Methods and Best Practices
Evaluation strategies, metrics, and best practices for fine-tuning LangGraph applications.
**💡 Tip**: For practical evaluation scripts and report templates, see [examples.md](examples.md#phase-2-baseline-evaluation-examples).
## 📚 Table of Contents
This guide is divided into the following sections:
### 1. [Evaluation Metrics Design](./evaluation_metrics.md)
Learn how to define and calculate metrics used for evaluation.
### 2. [Test Case Design](./evaluation_testcases.md)
Understand test case structure, coverage, and design principles.
### 3. [Statistical Significance Testing](./evaluation_statistics.md)
Master methods for multiple runs and statistical analysis.
### 4. [Evaluation Best Practices](./evaluation_practices.md)
Provides practical evaluation guidelines.
## 🎯 Quick Start
### For First-Time Evaluation
1. **[Understand Evaluation Metrics](./evaluation_metrics.md)** - Which metrics to measure
2. **[Design Test Cases](./evaluation_testcases.md)** - Create representative cases
3. **[Learn Statistical Methods](./evaluation_statistics.md)** - Importance of multiple runs
4. **[Follow Best Practices](./evaluation_practices.md)** - Effective evaluation implementation
### Improving Existing Evaluations
1. **[Add Metrics](./evaluation_metrics.md)** - More comprehensive evaluation
2. **[Improve Coverage](./evaluation_testcases.md)** - Enhance test cases
3. **[Strengthen Statistical Validation](./evaluation_statistics.md)** - Improve reliability
4. **[Introduce Automation](./evaluation_practices.md)** - Continuous evaluation pipeline
## 📖 Importance of Evaluation
In fine-tuning, evaluation provides:
- **Quantifying Improvements**: Objective progress measurement
- **Basis for Decision-Making**: Data-driven prioritization
- **Quality Assurance**: Prevention of regressions
- **ROI Demonstration**: Visualization of business value
## 💡 Basic Principles of Evaluation
For effective evaluation:
1.**Multiple Metrics**: Comprehensive assessment of quality, performance, cost, and reliability
2.**Statistical Validation**: Confirm significance through multiple runs
3.**Consistency**: Evaluate with the same test cases under the same conditions
4.**Visualization**: Track improvements with graphs and tables
5.**Documentation**: Record evaluation results and analysis
## 🔍 Troubleshooting
### Large Variance in Evaluation Results
→ Check [Statistical Significance Testing](./evaluation_statistics.md#outlier-detection-and-handling)
### Evaluation Takes Too Long
→ Implement staged evaluation in [Best Practices](./evaluation_practices.md#troubleshooting)
### Unclear Which Metrics to Measure
→ Check [Evaluation Metrics Design](./evaluation_metrics.md) for purpose and use cases of each metric
### Insufficient Test Cases
→ Refer to coverage analysis in [Test Case Design](./evaluation_testcases.md#test-case-design-principles)
## 📋 Related Documentation
- **[Prompt Optimization](./prompt_optimization.md)** - Techniques for prompt improvement
- **[Examples Collection](./examples.md)** - Samples of evaluation scripts and reports
- **[Workflow](./workflow.md)** - Overall fine-tuning flow including evaluation
- **[SKILL.md](./SKILL.md)** - Overview of the fine-tune skill
---
**💡 Tip**: For practical evaluation scripts and templates, see [examples.md](examples.md#phase-2-baseline-evaluation-examples).

View File

@@ -0,0 +1,340 @@
# Evaluation Metrics Design
Definitions and calculation methods for evaluation metrics in LangGraph application fine-tuning.
**💡 Tip**: For practical evaluation scripts and report templates, see [examples.md](examples.md#phase-2-baseline-evaluation-examples).
## 📊 Importance of Evaluation
In fine-tuning, evaluation provides:
- **Quantifying Improvements**: Objective progress measurement
- **Basis for Decision-Making**: Data-driven prioritization
- **Quality Assurance**: Prevention of regressions
- **ROI Demonstration**: Visualization of business value
## 🎯 Evaluation Metric Categories
### 1. Quality Metrics
#### Accuracy
```python
def calculate_accuracy(predictions: List, ground_truth: List) -> float:
"""Calculate accuracy"""
correct = sum(p == g for p, g in zip(predictions, ground_truth))
return (correct / len(predictions)) * 100
# Example
predictions = ["product", "technical", "billing", "general"]
ground_truth = ["product", "billing", "billing", "general"]
accuracy = calculate_accuracy(predictions, ground_truth)
# => 50.0% (2/4 correct)
```
#### F1 Score (Multi-class Classification)
```python
from sklearn.metrics import f1_score, classification_report
def calculate_f1(predictions: List, ground_truth: List, average='weighted') -> float:
"""Calculate F1 score (multi-class support)"""
return f1_score(ground_truth, predictions, average=average)
# Detailed report
report = classification_report(ground_truth, predictions)
print(report)
"""
precision recall f1-score support
product 1.00 1.00 1.00 1
technical 0.00 0.00 0.00 1
billing 0.50 1.00 0.67 1
general 1.00 1.00 1.00 1
accuracy 0.75 4
macro avg 0.62 0.75 0.67 4
weighted avg 0.62 0.75 0.67 4
"""
```
#### Semantic Similarity
```python
from sentence_transformers import SentenceTransformer, util
def calculate_semantic_similarity(
generated: str,
reference: str,
model_name: str = "all-MiniLM-L6-v2"
) -> float:
"""Calculate semantic similarity between generated and reference text"""
model = SentenceTransformer(model_name)
embeddings = model.encode([generated, reference], convert_to_tensor=True)
similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1])
return similarity.item()
# Example
generated = "Our premium plan costs $49 per month."
reference = "The premium subscription is $49/month."
similarity = calculate_semantic_similarity(generated, reference)
# => 0.87 (high similarity)
```
#### BLEU Score (Text Generation Quality)
```python
from nltk.translate.bleu_score import sentence_bleu
def calculate_bleu(generated: str, reference: str) -> float:
"""Calculate BLEU score"""
reference_tokens = [reference.split()]
generated_tokens = generated.split()
return sentence_bleu(reference_tokens, generated_tokens)
# Example
generated = "The product costs forty nine dollars"
reference = "The product costs $49"
bleu = calculate_bleu(generated, reference)
# => 0.45
```
### 2. Performance Metrics
#### Latency (Response Time)
```python
import time
from typing import Dict, List
def measure_latency(test_cases: List[Dict]) -> Dict:
"""Measure latency for each node and total"""
results = {
"total": [],
"by_node": {}
}
for case in test_cases:
start_time = time.time()
# Measurement by node
node_times = {}
# Node 1: analyze_intent
node_start = time.time()
analyze_result = analyze_intent(case["input"])
node_times["analyze_intent"] = time.time() - node_start
# Node 2: retrieve_context
node_start = time.time()
context = retrieve_context(analyze_result)
node_times["retrieve_context"] = time.time() - node_start
# Node 3: generate_response
node_start = time.time()
response = generate_response(context, case["input"])
node_times["generate_response"] = time.time() - node_start
total_time = time.time() - start_time
results["total"].append(total_time)
for node, duration in node_times.items():
if node not in results["by_node"]:
results["by_node"][node] = []
results["by_node"][node].append(duration)
# Statistical calculation
import numpy as np
summary = {
"total": {
"mean": np.mean(results["total"]),
"p50": np.percentile(results["total"], 50),
"p95": np.percentile(results["total"], 95),
"p99": np.percentile(results["total"], 99),
}
}
for node, times in results["by_node"].items():
summary[node] = {
"mean": np.mean(times),
"p50": np.percentile(times, 50),
"p95": np.percentile(times, 95),
}
return summary
# Usage example
latency_results = measure_latency(test_cases)
print(f"Mean latency: {latency_results['total']['mean']:.2f}s")
print(f"P95 latency: {latency_results['total']['p95']:.2f}s")
```
#### Throughput
```python
import concurrent.futures
from typing import List, Dict
def measure_throughput(
test_cases: List[Dict],
max_workers: int = 10,
duration_seconds: int = 60
) -> Dict:
"""Measure number of requests processed within a given time"""
start_time = time.time()
completed = 0
errors = 0
def process_case(case):
try:
result = run_langgraph_app(case["input"])
return True
except Exception:
return False
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
while time.time() - start_time < duration_seconds:
# Loop through test cases
for case in test_cases:
if time.time() - start_time >= duration_seconds:
break
future = executor.submit(process_case, case)
if future.result():
completed += 1
else:
errors += 1
elapsed = time.time() - start_time
return {
"completed": completed,
"errors": errors,
"elapsed": elapsed,
"throughput": completed / elapsed, # requests per second
"error_rate": errors / (completed + errors) if (completed + errors) > 0 else 0
}
# Usage example
throughput = measure_throughput(test_cases, max_workers=5, duration_seconds=30)
print(f"Throughput: {throughput['throughput']:.2f} req/s")
print(f"Error rate: {throughput['error_rate']*100:.2f}%")
```
### 3. Cost Metrics
#### Token Usage and Cost
```python
from typing import Dict
# Pricing table by model (as of November 2024)
PRICING = {
"claude-3-5-sonnet-20241022": {
"input": 3.0 / 1_000_000, # $3.00 per 1M input tokens
"output": 15.0 / 1_000_000, # $15.00 per 1M output tokens
},
"claude-3-5-haiku-20241022": {
"input": 0.8 / 1_000_000, # $0.80 per 1M input tokens
"output": 4.0 / 1_000_000, # $4.00 per 1M output tokens
}
}
def calculate_cost(token_usage: Dict, model: str) -> Dict:
"""Calculate cost from token usage"""
pricing = PRICING.get(model, PRICING["claude-3-5-sonnet-20241022"])
input_cost = token_usage["input_tokens"] * pricing["input"]
output_cost = token_usage["output_tokens"] * pricing["output"]
total_cost = input_cost + output_cost
return {
"input_tokens": token_usage["input_tokens"],
"output_tokens": token_usage["output_tokens"],
"total_tokens": token_usage["input_tokens"] + token_usage["output_tokens"],
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": total_cost,
"cost_breakdown": {
"input_pct": (input_cost / total_cost * 100) if total_cost > 0 else 0,
"output_pct": (output_cost / total_cost * 100) if total_cost > 0 else 0
}
}
# Usage example
token_usage = {"input_tokens": 1500, "output_tokens": 800}
cost = calculate_cost(token_usage, "claude-3-5-sonnet-20241022")
print(f"Total cost: ${cost['total_cost']:.4f}")
print(f"Input: ${cost['input_cost']:.4f} ({cost['cost_breakdown']['input_pct']:.1f}%)")
print(f"Output: ${cost['output_cost']:.4f} ({cost['cost_breakdown']['output_pct']:.1f}%)")
```
#### Cost per Request
```python
def calculate_cost_per_request(
test_results: List[Dict],
model: str
) -> Dict:
"""Calculate cost per request"""
total_cost = 0
total_input_tokens = 0
total_output_tokens = 0
for result in test_results:
cost = calculate_cost(result["token_usage"], model)
total_cost += cost["total_cost"]
total_input_tokens += result["token_usage"]["input_tokens"]
total_output_tokens += result["token_usage"]["output_tokens"]
num_requests = len(test_results)
return {
"total_requests": num_requests,
"total_cost": total_cost,
"cost_per_request": total_cost / num_requests,
"avg_input_tokens": total_input_tokens / num_requests,
"avg_output_tokens": total_output_tokens / num_requests,
"total_tokens": total_input_tokens + total_output_tokens
}
```
### 4. Reliability Metrics
#### Error Rate
```python
def calculate_error_rate(results: List[Dict]) -> Dict:
"""Analyze error rate and error types"""
total = len(results)
errors = [r for r in results if r.get("error")]
error_types = {}
for error in errors:
error_type = error["error"]["type"]
if error_type not in error_types:
error_types[error_type] = 0
error_types[error_type] += 1
return {
"total_requests": total,
"total_errors": len(errors),
"error_rate": len(errors) / total if total > 0 else 0,
"error_types": error_types,
"success_rate": (total - len(errors)) / total if total > 0 else 0
}
```
#### Retry Rate
```python
def calculate_retry_rate(results: List[Dict]) -> Dict:
"""Proportion of cases that required retries"""
total = len(results)
retried = [r for r in results if r.get("retry_count", 0) > 0]
return {
"total_requests": total,
"retried_requests": len(retried),
"retry_rate": len(retried) / total if total > 0 else 0,
"avg_retries": sum(r.get("retry_count", 0) for r in retried) / len(retried) if retried else 0
}
```
## 📋 Related Documentation
- [Test Case Design](./evaluation_testcases.md) - Test case structure and coverage
- [Statistical Significance Testing](./evaluation_statistics.md) - Multiple runs and statistical analysis
- [Evaluation Best Practices](./evaluation_practices.md) - Consistency, visualization, reporting

View File

@@ -0,0 +1,324 @@
# Evaluation Best Practices
Practical guidelines for effective evaluation of LangGraph applications.
## 🎯 Evaluation Best Practices
### 1. Ensuring Consistency
#### Evaluation Under Same Conditions
```python
class EvaluationConfig:
"""Fix evaluation settings to ensure consistency"""
def __init__(self):
self.test_cases_path = "tests/evaluation/test_cases.json"
self.seed = 42 # For reproducibility
self.iterations = 5
self.timeout = 30 # seconds
self.model = "claude-3-5-sonnet-20241022"
def load_test_cases(self) -> List[Dict]:
"""Load the same test cases"""
with open(self.test_cases_path) as f:
data = json.load(f)
return data["test_cases"]
# Usage
config = EvaluationConfig()
test_cases = config.load_test_cases()
# Use the same test cases for all evaluations
```
### 2. Staged Evaluation
#### Start Small and Gradually Expand
```python
# Phase 1: Quick check (3 cases, 1 iteration)
quick_results = evaluate(test_cases[:3], iterations=1)
if quick_results["accuracy"] > baseline["accuracy"]:
# Phase 2: Medium check (10 cases, 3 iterations)
medium_results = evaluate(test_cases[:10], iterations=3)
if medium_results["accuracy"] > baseline["accuracy"]:
# Phase 3: Full evaluation (all cases, 5 iterations)
full_results = evaluate(test_cases, iterations=5)
```
### 3. Recording Evaluation Results
#### Structured Logging
```python
import json
from datetime import datetime
from pathlib import Path
def save_evaluation_result(
results: Dict,
version: str,
output_dir: Path = Path("evaluation_results")
):
"""Save evaluation results"""
output_dir.mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{version}_{timestamp}.json"
full_results = {
"version": version,
"timestamp": timestamp,
"metrics": results,
"config": {
"model": "claude-3-5-sonnet-20241022",
"test_cases": len(test_cases),
"iterations": 5
}
}
with open(output_dir / filename, "w") as f:
json.dump(full_results, f, indent=2)
print(f"Results saved to: {output_dir / filename}")
# Usage
save_evaluation_result(results, version="baseline")
save_evaluation_result(results, version="iteration_1")
```
### 4. Visualization
#### Visualizing Results
```python
import matplotlib.pyplot as plt
def visualize_improvement(
baseline: Dict,
iterations: List[Dict],
metrics: List[str] = ["accuracy", "latency", "cost"]
):
"""Visualize improvement progress"""
fig, axes = plt.subplots(1, len(metrics), figsize=(15, 5))
for idx, metric in enumerate(metrics):
ax = axes[idx]
# Prepare data
x = ["Baseline"] + [f"Iter {i+1}" for i in range(len(iterations))]
y = [baseline[metric]] + [it[metric] for it in iterations]
# Plot
ax.plot(x, y, marker='o', linewidth=2)
ax.set_title(f"{metric.capitalize()} Progress")
ax.set_ylabel(metric.capitalize())
ax.grid(True, alpha=0.3)
# Goal line
if metric in baseline.get("goals", {}):
goal = baseline["goals"][metric]
ax.axhline(y=goal, color='r', linestyle='--', label='Goal')
ax.legend()
plt.tight_layout()
plt.savefig("evaluation_results/improvement_progress.png")
print("Visualization saved to: evaluation_results/improvement_progress.png")
```
## 📋 Evaluation Report Template
### Standard Report Format
```markdown
# Evaluation Report - [Version/Iteration]
Execution Date: 2024-11-24 12:00:00
Executed by: Claude Code (fine-tune skill)
## Configuration
- **Model**: claude-3-5-sonnet-20241022
- **Number of Test Cases**: 20
- **Number of Runs**: 5
- **Evaluation Duration**: 10 minutes
## Results Summary
| Metric | Mean | Std Dev | 95% CI | Goal | Achievement |
|--------|------|---------|--------|------|-------------|
| Accuracy | 86.0% | 2.1% | [83.9%, 88.1%] | 90.0% | 95.6% |
| Latency | 2.4s | 0.3s | [2.1s, 2.7s] | 2.0s | 83.3% |
| Cost | $0.014 | $0.001 | [$0.013, $0.015] | $0.010 | 71.4% |
## Detailed Analysis
### Accuracy
- **Improvement**: +11.0% (75.0% → 86.0%)
- **Statistical Significance**: p < 0.01 ✅
- **Effect Size**: Cohen's d = 2.3 (large)
### Latency
- **Improvement**: -0.1s (2.5s → 2.4s)
- **Statistical Significance**: p = 0.12 ❌ (not significant)
- **Effect Size**: Cohen's d = 0.3 (small)
## Error Analysis
- **Total Errors**: 0
- **Error Rate**: 0.0%
- **Retry Rate**: 0.0%
## Next Actions
1. ✅ Accuracy significantly improved → Continue
2. ⚠️ Latency improvement is small → Focus in next iteration
3. ⚠️ Cost still below goal → Consider max_tokens limit
```
## 🔍 Troubleshooting
### Common Problems and Solutions
#### 1. Large Variance in Evaluation Results
**Symptom**: Standard deviation > 20% of mean
**Causes**:
- LLM temperature is too high
- Test cases are uneven
- Network latency effects
**Solutions**:
```python
# Lower temperature
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.3 # Set lower
)
# Increase number of runs
iterations = 10 # 5 → 10
# Remove outliers
results_clean = remove_outliers(results)
```
#### 2. Evaluation Takes Too Long
**Symptom**: Evaluation takes over 1 hour
**Causes**:
- Too many test cases
- Not running in parallel
- Timeout setting too long
**Solutions**:
```python
# Subset evaluation
quick_test_cases = test_cases[:10] # First 10 cases only
# Parallel execution
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(evaluate_case, case) for case in test_cases]
results = [f.result() for f in futures]
# Timeout setting
timeout = 10 # 30s → 10s
```
#### 3. No Statistical Significance
**Symptom**: p-value ≥ 0.05
**Causes**:
- Improvement effect is small
- Insufficient sample size
- High data variance
**Solutions**:
```python
# Aim for larger improvements
# - Apply multiple optimizations simultaneously
# - Choose more effective techniques
# Increase sample size
iterations = 20 # 5 → 20
# Reduce variance
# - Lower temperature
# - Stabilize evaluation environment
```
## 📊 Continuous Evaluation
### Scheduled Evaluation
```yaml
evaluation_schedule:
daily:
- quick_check: 3 test cases, 1 iteration
- purpose: Detect major regressions
weekly:
- medium_check: 10 test cases, 3 iterations
- purpose: Continuous quality monitoring
before_release:
- full_evaluation: all test cases, 5-10 iterations
- purpose: Release quality assurance
after_major_changes:
- comprehensive_evaluation: all test cases, 10+ iterations
- purpose: Impact assessment of major changes
```
### Automated Evaluation Pipeline
```bash
#!/bin/bash
# continuous_evaluation.sh
# Daily evaluation script
DATE=$(date +%Y%m%d)
RESULTS_DIR="evaluation_results/continuous/$DATE"
mkdir -p $RESULTS_DIR
# Quick check
echo "Running quick evaluation..."
uv run python -m tests.evaluation.evaluator \
--test-cases 3 \
--iterations 1 \
--output "$RESULTS_DIR/quick.json"
# Compare with previous results
uv run python -m tests.evaluation.compare \
--baseline "evaluation_results/baseline/summary.json" \
--current "$RESULTS_DIR/quick.json" \
--threshold 0.05
# Notify if regression detected
if [ $? -ne 0 ]; then
echo "⚠️ Regression detected! Sending notification..."
# Notification process (Slack, Email, etc.)
fi
```
## Summary
For effective evaluation:
-**Multiple Metrics**: Quality, performance, cost, reliability
-**Statistical Validation**: Multiple runs and significance testing
-**Consistency**: Same test cases, same conditions
-**Visualization**: Track improvements with graphs and tables
-**Documentation**: Record evaluation results and analysis
## 📋 Related Documentation
- [Evaluation Metrics](./evaluation_metrics.md) - Metric definitions and calculation methods
- [Test Case Design](./evaluation_testcases.md) - Test case structure
- [Statistical Significance](./evaluation_statistics.md) - Statistical analysis methods

View File

@@ -0,0 +1,315 @@
# Statistical Significance Testing
Statistical approaches and significance testing in LangGraph application evaluation.
## 📈 Importance of Multiple Runs
### Why Multiple Runs Are Necessary
1. **Account for Randomness**: LLM outputs have probabilistic variation
2. **Detect Outliers**: Eliminate effects like temporary network latency
3. **Calculate Confidence Intervals**: Determine if improvements are statistically significant
### Recommended Number of Runs
| Phase | Runs | Purpose |
|-------|------|---------|
| **During Development** | 3 | Quick feedback |
| **During Evaluation** | 5 | Balanced reliability |
| **Before Production** | 10-20 | High statistical confidence |
## 📊 Statistical Analysis
### Basic Statistical Calculations
```python
import numpy as np
from scipy import stats
def statistical_analysis(
baseline_results: List[float],
improved_results: List[float],
alpha: float = 0.05
) -> Dict:
"""Statistical comparison of baseline and improved versions"""
# Basic statistics
baseline_stats = {
"mean": np.mean(baseline_results),
"std": np.std(baseline_results),
"median": np.median(baseline_results),
"min": np.min(baseline_results),
"max": np.max(baseline_results)
}
improved_stats = {
"mean": np.mean(improved_results),
"std": np.std(improved_results),
"median": np.median(improved_results),
"min": np.min(improved_results),
"max": np.max(improved_results)
}
# Independent t-test
t_statistic, p_value = stats.ttest_ind(improved_results, baseline_results)
# Effect size (Cohen's d)
pooled_std = np.sqrt(
((len(baseline_results) - 1) * baseline_stats["std"]**2 +
(len(improved_results) - 1) * improved_stats["std"]**2) /
(len(baseline_results) + len(improved_results) - 2)
)
cohens_d = (improved_stats["mean"] - baseline_stats["mean"]) / pooled_std
# Improvement percentage
improvement_pct = (
(improved_stats["mean"] - baseline_stats["mean"]) /
baseline_stats["mean"] * 100
)
# Confidence intervals (95%)
ci_baseline = stats.t.interval(
0.95,
len(baseline_results) - 1,
loc=baseline_stats["mean"],
scale=stats.sem(baseline_results)
)
ci_improved = stats.t.interval(
0.95,
len(improved_results) - 1,
loc=improved_stats["mean"],
scale=stats.sem(improved_results)
)
# Determine statistical significance
is_significant = p_value < alpha
# Interpret effect size
effect_size_interpretation = (
"small" if abs(cohens_d) < 0.5 else
"medium" if abs(cohens_d) < 0.8 else
"large"
)
return {
"baseline": baseline_stats,
"improved": improved_stats,
"comparison": {
"improvement_pct": improvement_pct,
"t_statistic": t_statistic,
"p_value": p_value,
"is_significant": is_significant,
"cohens_d": cohens_d,
"effect_size": effect_size_interpretation
},
"confidence_intervals": {
"baseline": ci_baseline,
"improved": ci_improved
}
}
# Usage example
baseline_accuracy = [73.0, 75.0, 77.0, 74.0, 76.0] # 5 run results
improved_accuracy = [85.0, 87.0, 86.0, 88.0, 84.0] # 5 run results after improvement
analysis = statistical_analysis(baseline_accuracy, improved_accuracy)
print(f"Improvement: {analysis['comparison']['improvement_pct']:.1f}%")
print(f"P-value: {analysis['comparison']['p_value']:.4f}")
print(f"Significant: {analysis['comparison']['is_significant']}")
print(f"Effect size: {analysis['comparison']['effect_size']}")
```
## 🎯 Interpreting Statistical Significance
### P-value Interpretation
| P-value | Interpretation | Action |
|---------|---------------|--------|
| p < 0.01 | **Highly significant** | Adopt improvement with confidence |
| p < 0.05 | **Significant** | Can adopt as improvement |
| p < 0.10 | **Marginally significant** | Consider additional validation |
| p ≥ 0.10 | **Not significant** | Judge as no improvement effect |
### Effect Size (Cohen's d) Interpretation
| Cohen's d | Effect Size | Meaning |
|-----------|------------|---------|
| d < 0.2 | **Negligible** | No substantial improvement |
| 0.2 ≤ d < 0.5 | **Small** | Slight improvement |
| 0.5 ≤ d < 0.8 | **Medium** | Clear improvement |
| d ≥ 0.8 | **Large** | Significant improvement |
## 📉 Outlier Detection and Handling
### Outlier Detection
```python
def detect_outliers(data: List[float], method: str = "iqr") -> List[int]:
"""Detect outlier indices"""
data_array = np.array(data)
if method == "iqr":
# IQR method (Interquartile Range)
q1 = np.percentile(data_array, 25)
q3 = np.percentile(data_array, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [
i for i, val in enumerate(data)
if val < lower_bound or val > upper_bound
]
elif method == "zscore":
# Z-score method
mean = np.mean(data_array)
std = np.std(data_array)
z_scores = np.abs((data_array - mean) / std)
outliers = [i for i, z in enumerate(z_scores) if z > 3]
return outliers
# Usage example
results = [75.0, 76.0, 74.0, 77.0, 95.0] # 95.0 may be an outlier
outliers = detect_outliers(results, method="iqr")
print(f"Outlier indices: {outliers}") # => [4]
```
### Outlier Handling Policy
1. **Investigation**: Identify why outliers occurred
2. **Removal Decision**:
- Clear errors (network failure, etc.) → Remove
- Actual performance variation → Keep
3. **Documentation**: Document cause and handling of outliers
## 🔄 Considerations for Repeated Measurements
### Sample Size Calculation
```python
from scipy.stats import ttest_ind_from_stats
def required_sample_size(
baseline_mean: float,
baseline_std: float,
expected_improvement_pct: float,
alpha: float = 0.05,
power: float = 0.8
) -> int:
"""Estimate required sample size"""
improved_mean = baseline_mean * (1 + expected_improvement_pct / 100)
# Calculate effect size
effect_size = abs(improved_mean - baseline_mean) / baseline_std
# Simple estimation (use statsmodels.stats.power for more accuracy)
if effect_size < 0.2:
return 100 # Small effect requires many samples
elif effect_size < 0.5:
return 50
elif effect_size < 0.8:
return 30
else:
return 20
# Usage example
sample_size = required_sample_size(
baseline_mean=75.0,
baseline_std=3.0,
expected_improvement_pct=10.0
)
print(f"Required sample size: {sample_size}")
```
## 📊 Visualizing Confidence Intervals
```python
import matplotlib.pyplot as plt
def plot_confidence_intervals(
baseline_results: List[float],
improved_results: List[float],
labels: List[str] = ["Baseline", "Improved"]
):
"""Plot confidence intervals"""
fig, ax = plt.subplots(figsize=(10, 6))
# Statistical calculations
baseline_mean = np.mean(baseline_results)
baseline_ci = stats.t.interval(
0.95,
len(baseline_results) - 1,
loc=baseline_mean,
scale=stats.sem(baseline_results)
)
improved_mean = np.mean(improved_results)
improved_ci = stats.t.interval(
0.95,
len(improved_results) - 1,
loc=improved_mean,
scale=stats.sem(improved_results)
)
# Plot
positions = [1, 2]
means = [baseline_mean, improved_mean]
cis = [
(baseline_mean - baseline_ci[0], baseline_ci[1] - baseline_mean),
(improved_mean - improved_ci[0], improved_ci[1] - improved_mean)
]
ax.errorbar(positions, means, yerr=np.array(cis).T, fmt='o', markersize=10, capsize=10)
ax.set_xticks(positions)
ax.set_xticklabels(labels)
ax.set_ylabel("Metric Value")
ax.set_title("Comparison with 95% Confidence Intervals")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("confidence_intervals.png")
print("Plot saved: confidence_intervals.png")
```
## 📋 Statistical Report Template
```markdown
## Statistical Analysis Results
### Basic Statistics
| Metric | Baseline | Improved | Improvement |
|--------|----------|----------|-------------|
| Mean | 75.0% | 86.0% | +11.0% |
| Std Dev | 3.2% | 2.1% | -1.1% |
| Median | 75.0% | 86.0% | +11.0% |
| Min | 70.0% | 84.0% | +14.0% |
| Max | 80.0% | 88.0% | +8.0% |
### Statistical Tests
- **t-statistic**: 8.45
- **P-value**: 0.0001 (p < 0.01)
- **Statistical Significance**: ✅ Highly significant
- **Effect Size (Cohen's d)**: 2.3 (large)
### Confidence Intervals (95%)
- **Baseline**: [72.8%, 77.2%]
- **Improved**: [84.9%, 87.1%]
### Conclusion
The improvement is statistically highly significant (p < 0.01), with a large effect size (Cohen's d = 2.3).
There is no overlap in confidence intervals, confirming the improvement effect is certain.
```
## 📋 Related Documentation
- [Evaluation Metrics](./evaluation_metrics.md) - Metric definitions and calculation methods
- [Test Case Design](./evaluation_testcases.md) - Test case structure
- [Best Practices](./evaluation_practices.md) - Practical evaluation guide

View File

@@ -0,0 +1,279 @@
# Test Case Design
Structure, coverage, and design principles for test cases used in LangGraph application evaluation.
## 🧪 Test Case Structure
### Representative Test Case Structure
```json
{
"test_cases": [
{
"id": "TC001",
"category": "product_inquiry",
"difficulty": "easy",
"input": "How much does the premium plan cost?",
"expected_intent": "product_inquiry",
"expected_answer": "The premium plan costs $49 per month.",
"expected_answer_semantic": ["premium", "plan", "$49", "month"],
"metadata": {
"user_type": "new",
"context_required": false
}
},
{
"id": "TC002",
"category": "technical_support",
"difficulty": "medium",
"input": "I can't seem to log into my account even after resetting my password",
"expected_intent": "technical_support",
"expected_answer": "Let me help you troubleshoot the login issue. First, please clear your browser cache and cookies, then try logging in again.",
"expected_answer_semantic": ["troubleshoot", "clear cache", "cookies", "try again"],
"metadata": {
"user_type": "existing",
"context_required": true,
"requires_escalation": false
}
},
{
"id": "TC003",
"category": "edge_case",
"difficulty": "hard",
"input": "yo whats the deal with my bill being so high lol",
"expected_intent": "billing",
"expected_answer": "I understand you have concerns about your bill. Let me review your account to identify any unexpected charges.",
"expected_answer_semantic": ["concerns", "bill", "review", "charges"],
"metadata": {
"user_type": "existing",
"context_required": true,
"tone": "informal",
"requires_empathy": true
}
}
]
}
```
## 📊 Test Case Coverage
### Balance by Category
```python
def analyze_test_coverage(test_cases: List[Dict]) -> Dict:
"""Analyze test case coverage"""
categories = {}
difficulties = {}
for case in test_cases:
# Category
cat = case.get("category", "unknown")
categories[cat] = categories.get(cat, 0) + 1
# Difficulty
diff = case.get("difficulty", "unknown")
difficulties[diff] = difficulties.get(diff, 0) + 1
total = len(test_cases)
return {
"total_cases": total,
"by_category": {
cat: {"count": count, "percentage": count/total*100}
for cat, count in categories.items()
},
"by_difficulty": {
diff: {"count": count, "percentage": count/total*100}
for diff, count in difficulties.items()
}
}
```
### Recommended Balance
```yaml
category_balance:
description: "Recommended distribution by category"
recommendations:
- main_categories: "20-30% (evenly distributed)"
- edge_cases: "10-15% (sufficient abnormal case coverage)"
difficulty_balance:
description: "Recommended distribution by difficulty"
recommendations:
- easy: "40-50% (basic functionality verification)"
- medium: "30-40% (practical cases)"
- hard: "10-20% (edge cases and complex scenarios)"
```
## 🎯 Test Case Design Principles
### 1. Representativeness
- **Reflect Real Use Cases**: Cover actual user input patterns
- **Weight by Frequency**: Include more common cases
### 2. Diversity
- **Comprehensive Categories**: Cover all major categories
- **Difficulty Variation**: From easy to hard
- **Edge Cases**: Abnormal cases, ambiguous cases, boundary values
### 3. Clarity
- **Clear Expectations**: Be specific with expected_answer
- **Explicit Criteria**: Clearly define correctness criteria
### 4. Maintainability
- **ID-based Tracking**: Unique ID for each test case
- **Rich Metadata**: Category, difficulty, and other attributes
## 📝 Test Case Templates
### Basic Template
```json
{
"id": "TC[number]",
"category": "[category name]",
"difficulty": "easy|medium|hard",
"input": "[user input]",
"expected_intent": "[expected intent]",
"expected_answer": "[expected answer]",
"expected_answer_semantic": ["keyword1", "keyword2"],
"metadata": {
"user_type": "new|existing",
"context_required": true|false,
"specific_flag": true|false
}
}
```
### Templates by Category
#### Product Inquiry
```json
{
"id": "TC_PRODUCT_001",
"category": "product_inquiry",
"difficulty": "easy",
"input": "Question about product",
"expected_intent": "product_inquiry",
"expected_answer": "Answer including product information",
"metadata": {
"product_type": "premium|basic|enterprise",
"question_type": "pricing|features|comparison"
}
}
```
#### Technical Support
```json
{
"id": "TC_TECH_001",
"category": "technical_support",
"difficulty": "medium",
"input": "Technical problem report",
"expected_intent": "technical_support",
"expected_answer": "Troubleshooting steps",
"metadata": {
"issue_type": "login|performance|bug",
"requires_escalation": false,
"urgency": "low|medium|high"
}
}
```
#### Billing
```json
{
"id": "TC_BILLING_001",
"category": "billing",
"difficulty": "medium",
"input": "Billing question",
"expected_intent": "billing",
"expected_answer": "Billing explanation and next steps",
"metadata": {
"billing_type": "charge|refund|subscription",
"requires_account_access": true
}
}
```
#### Edge Cases
```json
{
"id": "TC_EDGE_001",
"category": "edge_case",
"difficulty": "hard",
"input": "Ambiguous, non-standard, or unexpected input",
"expected_intent": "appropriate fallback",
"expected_answer": "Polite clarification request",
"metadata": {
"edge_type": "ambiguous|off_topic|malformed",
"requires_empathy": true
}
}
```
## 🔍 Test Case Evaluation
### Quality Checklist
```python
def validate_test_case(test_case: Dict) -> List[str]:
"""Check test case quality"""
issues = []
# Check required fields
required_fields = ["id", "category", "difficulty", "input", "expected_intent"]
for field in required_fields:
if field not in test_case:
issues.append(f"Missing required field: {field}")
# ID uniqueness (requires separate check)
# Input length check
if len(test_case.get("input", "")) < 5:
issues.append("Input too short (minimum 5 characters)")
# Category validity
valid_categories = ["product_inquiry", "technical_support", "billing", "general", "edge_case"]
if test_case.get("category") not in valid_categories:
issues.append(f"Invalid category: {test_case.get('category')}")
# Difficulty validity
valid_difficulties = ["easy", "medium", "hard"]
if test_case.get("difficulty") not in valid_difficulties:
issues.append(f"Invalid difficulty: {test_case.get('difficulty')}")
return issues
```
## 📈 Coverage Report
### Coverage Analysis Script
```python
def generate_coverage_report(test_cases: List[Dict]) -> str:
"""Generate test case coverage report"""
coverage = analyze_test_coverage(test_cases)
report = f"""# Test Case Coverage Report
## Summary
- **Total Test Cases**: {coverage['total_cases']}
## By Category
"""
for cat, data in coverage['by_category'].items():
report += f"- **{cat}**: {data['count']} cases ({data['percentage']:.1f}%)\n"
report += "\n## By Difficulty\n"
for diff, data in coverage['by_difficulty'].items():
report += f"- **{diff}**: {data['count']} cases ({data['percentage']:.1f}%)\n"
return report
```
## 📋 Related Documentation
- [Evaluation Metrics](./evaluation_metrics.md) - Metric definitions and calculation methods
- [Statistical Significance](./evaluation_statistics.md) - Multiple runs and statistical analysis
- [Best Practices](./evaluation_practices.md) - Practical evaluation guide

View File

@@ -0,0 +1,119 @@
# Fine-Tuning Practical Examples Collection
A collection of specific code examples and markdown templates used for LangGraph application fine-tuning.
## 📋 Table of Contents
This guide is divided by Phase:
### [Phase 1: Preparation and Analysis Examples](./examples_phase1.md)
Templates and code examples used in the optimization preparation phase:
- **Example 1.1**: fine-tune.md structure example
- **Example 1.2**: Optimization target list example
- **Example 1.3**: Code search example with Serena MCP
**Estimated Time**: 30 minutes - 1 hour
### [Phase 2: Baseline Evaluation Examples](./examples_phase2.md)
Scripts and report examples used for current performance measurement:
- **Example 2.1**: Evaluation script (evaluator.py)
- **Example 2.2**: Baseline measurement script (baseline_evaluation.sh)
- **Example 2.3**: Baseline results report
**Estimated Time**: 1-2 hours
### [Phase 3: Iterative Improvement Examples](./examples_phase3.md)
Practical examples of prompt optimization and result comparison:
- **Example 3.1**: Before/After prompt comparison
- **Example 3.2**: Prioritization matrix
- **Example 3.3**: Iteration results report
**Estimated Time**: 1-2 hours per iteration × number of iterations
### [Phase 4: Completion and Documentation Examples](./examples_phase4.md)
Examples of recording final results and version control:
- **Example 4.1**: Final evaluation report (complete version)
- **Example 4.2**: Git commit message examples
**Estimated Time**: 30 minutes - 1 hour
## 🎯 How to Use
### For First-Time Implementation
1. **Start with [Phase 1 examples](./examples_phase1.md)** - Copy and use templates
2. **Set up [Phase 2 evaluation scripts](./examples_phase2.md)** - Customize for your environment
3. **Iterate using [Phase 3 comparison examples](./examples_phase3.md)** - Record Before/After
4. **Document with [Phase 4 report](./examples_phase4.md)** - Summarize final results
### Copy & Paste Ready
Each example includes complete code and templates:
- Python scripts → Ready to execute as-is
- Bash scripts → Set environment variables and run
- Markdown templates → Fill in content and use
- JSON structures → Templates for test cases and reports
## 📊 Types of Examples
### Code Scripts
- **Evaluation scripts** (Phase 2): evaluator.py, aggregate_results.py
- **Measurement scripts** (Phase 2): baseline_evaluation.sh
- **Analysis scripts** (Phase 1): Serena MCP search examples
### Markdown Templates
- **fine-tune.md** (Phase 1): Goal setting
- **Optimization target list** (Phase 1): Organizing improvement targets
- **Baseline results report** (Phase 2): Current state analysis
- **Iteration results report** (Phase 3): Improvement effect measurement
- **Final evaluation report** (Phase 4): Overall summary
### Comparison Examples
- **Before/After prompts** (Phase 3): Specific improvement examples
- **Prioritization matrix** (Phase 3): Decision-making records
## 🔍 Finding Examples
### By Purpose
| Purpose | Phase | Example |
|---------|-------|---------|
| Set goals | Phase 1 | [Example 1.1](./examples_phase1.md#example-11-fine-tunemd-structure-example) |
| Find optimization targets | Phase 1 | [Example 1.3](./examples_phase1.md#example-13-code-search-example-with-serena-mcp) |
| Create evaluation scripts | Phase 2 | [Example 2.1](./examples_phase2.md#example-21-evaluation-script) |
| Measure baseline | Phase 2 | [Example 2.2](./examples_phase2.md#example-22-baseline-measurement-script) |
| Improve prompts | Phase 3 | [Example 3.1](./examples_phase3.md#example-31-beforeafter-prompt-comparison) |
| Determine priorities | Phase 3 | [Example 3.2](./examples_phase3.md#example-32-prioritization-matrix) |
| Write final report | Phase 4 | [Example 4.1](./examples_phase4.md#example-41-final-evaluation-report) |
| Git commit | Phase 4 | [Example 4.2](./examples_phase4.md#example-42-git-commit-message-examples) |
## 🔗 Related Documentation
- **[Workflow](./workflow.md)** - Detailed procedures for each Phase
- **[Evaluation Methods](./evaluation.md)** - Evaluation metrics and statistical analysis
- **[Prompt Optimization](./prompt_optimization.md)** - Detailed optimization techniques
- **[SKILL.md](./SKILL.md)** - Overview of the Fine-tune skill
## 💡 Tips
### Customization Points
1. **Number of test cases**: Examples use 20 cases, but adjust according to your project
2. **Number of runs**: 3-5 runs recommended for baseline measurement, but adjust based on time constraints
3. **Target values**: Set Accuracy, Latency, and Cost targets according to project requirements
4. **Model**: Adjust pricing if using models other than Claude 3.5 Sonnet
### Frequently Asked Questions
**Q: Can I use the example code as-is?**
A: Yes, it's executable once you set environment variables (API keys, etc.).
**Q: Can I edit the templates?**
A: Yes, please customize freely according to your project.
**Q: Can I skip phases?**
A: We recommend executing all phases on the first run. From the second run onward, you can start from Phase 2.
---
**💡 Tip**: For detailed procedures of each Phase, refer to the [Workflow](./workflow.md).

View File

@@ -0,0 +1,174 @@
# Phase 1: Preparation and Analysis Examples
Practical code examples and templates.
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 1](./workflow_phase1.md)
---
## Phase 1: Preparation and Analysis Examples
### Example 1.1: fine-tune.md Structure Example
**File**: `.langgraph-master/fine-tune.md`
```markdown
# Fine-Tuning Goals
## Optimization Objectives
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
- **Latency**: Reduce response time to 2.0 seconds or less
- **Cost**: Reduce cost per request to $0.010 or less
## Evaluation Method
### Test Cases
- **Dataset**: tests/evaluation/test_cases.json (20 cases)
- **Execution Command**: uv run python -m src.evaluate
- **Evaluation Script**: tests/evaluation/evaluator.py
### Evaluation Metrics
#### Accuracy (Correctness Rate)
- **Calculation Method**: (Number of correct answers / Total cases) × 100
- **Target Value**: 90% or higher
#### Latency (Response Time)
- **Calculation Method**: Average time of each execution
- **Target Value**: 2.0 seconds or less
#### Cost
- **Calculation Method**: Total API cost / Total number of requests
- **Target Value**: $0.010 or less
## Pass Criteria
All evaluation metrics must achieve their target values.
```
### Example 1.2: Optimization Target List Example
```markdown
# Optimization Target Nodes
## Node: analyze_intent
### Basic Information
- **File**: src/nodes/analyzer.py:25-45
- **Role**: Classify user input intent
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=1.0, max_tokens=default
### Current Prompt
\```python
SystemMessage(content="You are an intent analyzer. Analyze user input.")
HumanMessage(content=f"Analyze: {user_input}")
\```
### Issues
1. **Ambiguous instructions**: Specific criteria for "Analyze" are unclear
2. **No few-shot examples**: No expected output examples
3. **Undefined output format**: Free text, not structured
4. **High temperature**: 1.0 is too high for classification tasks
### Improvement Proposals
1. Specify concrete classification categories
2. Add 3-5 few-shot examples
3. Specify JSON output format
4. Lower temperature to 0.3-0.5
### Estimated Improvement Effect
- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
- **Latency**: ±0 (no change)
- **Cost**: ±0 (no change)
### Priority
⭐⭐⭐⭐⭐ (Highest priority) - Direct impact on accuracy improvement
---
## Node: generate_response
### Basic Information
- **File**: src/nodes/generator.py:45-68
- **Role**: Generate final user-facing response
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=0.7, max_tokens=default
### Current Prompt
\```python
ChatPromptTemplate.from_messages([
("system", "Generate helpful response based on context."),
("human", "{context}\n\nQuestion: {question}")
])
\```
### Issues
1. **No redundancy control**: No instructions for conciseness
2. **max_tokens not set**: Possibility of unnecessarily long output
3. **Response style undefined**: No specification of tone or style
### Improvement Proposals
1. Add length instructions like "concisely" "in 2-3 sentences"
2. Limit max_tokens to 500
3. Clarify response style ("friendly" "professional", etc.)
### Estimated Improvement Effect
- **Accuracy**: ±0 (no change)
- **Latency**: -0.3-0.5s (due to reduced output tokens)
- **Cost**: -20-30% (due to reduced token count)
### Priority
⭐⭐⭐ (Medium) - Improvement in latency and cost
```
### Example 1.3: Code Search Example with Serena MCP
```python
# Search for LLM client
from mcp_serena import find_symbol, find_referencing_symbols
# Step 1: Search for ChatAnthropic usage locations
chat_anthropic_usages = find_symbol(
name_path="ChatAnthropic",
substring_matching=True,
include_body=False
)
print(f"Found {len(chat_anthropic_usages)} ChatAnthropic usages")
# Step 2: Investigate details of each usage location
for usage in chat_anthropic_usages:
print(f"\nFile: {usage.relative_path}:{usage.line_start}")
print(f"Context: {usage.name_path}")
# Identify prompt construction locations
references = find_referencing_symbols(
name_path=usage.name,
relative_path=usage.relative_path
)
# Display locations that may contain prompts
for ref in references:
if "message" in ref.name.lower() or "prompt" in ref.name.lower():
print(f" - Potential prompt location: {ref.name_path}")
```
---

View File

@@ -0,0 +1,194 @@
# Phase 2: Baseline Evaluation Examples
Examples of evaluation scripts and result reports.
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 2](./workflow_phase2.md) | [Evaluation Methods](./evaluation.md)
---
## Phase 2: Baseline Evaluation Examples
### Example 2.1: Evaluation Script
**File**: `tests/evaluation/evaluator.py`
```python
import json
import time
from pathlib import Path
from typing import Dict, List
def evaluate_test_cases(test_cases: List[Dict]) -> Dict:
"""Evaluate test cases"""
results = {
"total_cases": len(test_cases),
"correct": 0,
"total_latency": 0.0,
"total_cost": 0.0,
"case_results": []
}
for case in test_cases:
start_time = time.time()
# Run LangGraph application
output = run_langgraph_app(case["input"])
latency = time.time() - start_time
# Correctness judgment
is_correct = output["answer"] == case["expected_answer"]
if is_correct:
results["correct"] += 1
# Cost calculation (from token usage)
cost = calculate_cost(output["token_usage"])
results["total_latency"] += latency
results["total_cost"] += cost
results["case_results"].append({
"case_id": case["id"],
"correct": is_correct,
"latency": latency,
"cost": cost
})
# Calculate metrics
results["accuracy"] = (results["correct"] / results["total_cases"]) * 100
results["avg_latency"] = results["total_latency"] / results["total_cases"]
results["avg_cost"] = results["total_cost"] / results["total_cases"]
return results
def calculate_cost(token_usage: Dict) -> float:
"""Calculate cost from token usage"""
# Claude 3.5 Sonnet pricing
INPUT_COST_PER_1M = 3.0 # $3.00 per 1M input tokens
OUTPUT_COST_PER_1M = 15.0 # $15.00 per 1M output tokens
input_cost = (token_usage["input_tokens"] / 1_000_000) * INPUT_COST_PER_1M
output_cost = (token_usage["output_tokens"] / 1_000_000) * OUTPUT_COST_PER_1M
return input_cost + output_cost
if __name__ == "__main__":
# Load test cases
with open("tests/evaluation/test_cases.json") as f:
test_cases = json.load(f)["test_cases"]
# Execute evaluation
results = evaluate_test_cases(test_cases)
# Save results
with open("evaluation_results/baseline_run.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Accuracy: {results['accuracy']:.1f}%")
print(f"Avg Latency: {results['avg_latency']:.2f}s")
print(f"Avg Cost: ${results['avg_cost']:.4f}")
```
### Example 2.2: Baseline Measurement Script
**File**: `scripts/baseline_evaluation.sh`
```bash
#!/bin/bash
ITERATIONS=5
RESULTS_DIR="evaluation_results/baseline"
mkdir -p $RESULTS_DIR
echo "Starting baseline evaluation: $ITERATIONS iterations"
for i in $(seq 1 $ITERATIONS); do
echo "----------------------------------------"
echo "Iteration $i/$ITERATIONS"
echo "----------------------------------------"
uv run python -m tests.evaluation.evaluator \
--output "$RESULTS_DIR/run_$i.json" \
--verbose
echo "Completed iteration $i"
# API rate limit mitigation
if [ $i -lt $ITERATIONS ]; then
echo "Waiting 5 seconds before next iteration..."
sleep 5
fi
done
echo ""
echo "All iterations completed. Aggregating results..."
# Aggregate results
uv run python -m tests.evaluation.aggregate \
--input-dir "$RESULTS_DIR" \
--output "$RESULTS_DIR/summary.json"
echo "Baseline evaluation complete!"
echo "Results saved to: $RESULTS_DIR/summary.json"
```
### Example 2.3: Baseline Results Report
```markdown
# Baseline Evaluation Results
Execution Date/Time: 2024-11-24 10:00:00
Number of Runs: 5
Number of Test Cases: 20
## Evaluation Metrics Summary
| Metric | Average | Std Dev | Min | Max | Target | Gap |
| -------- | ------- | ------- | ------ | ------ | ------ | ---------- |
| Accuracy | 75.0% | 3.2% | 70.0% | 80.0% | 90.0% | **-15.0%** |
| Latency | 2.5s | 0.4s | 2.1s | 3.2s | 2.0s | **+0.5s** |
| Cost/req | $0.015 | $0.002 | $0.013 | $0.018 | $0.010 | **+$0.005** |
## Detailed Analysis
### Accuracy Issues
- **Current**: 75.0% (Target: 90.0%)
- **Main incorrect answer patterns**:
1. Intent classification errors: 12 cases (60% of errors)
2. Insufficient context understanding: 5 cases (25% of errors)
3. Ambiguous question handling: 3 cases (15% of errors)
### Latency Issues
- **Current**: 2.5s (Target: 2.0s)
- **Bottlenecks**:
1. generate_response node: Average 1.8s (72% of total)
2. analyze_intent node: Average 0.5s (20% of total)
3. Other: Average 0.2s (8% of total)
### Cost Issues
- **Current**: $0.015/req (Target: $0.010/req)
- **Cost breakdown**:
1. generate_response: $0.011 (73%)
2. analyze_intent: $0.003 (20%)
3. Other: $0.001 (7%)
- **Main factor**: High output token count (average 800 tokens)
## Improvement Directions
### Priority 1: Improve analyze_intent accuracy
- **Impact**: Direct impact on Accuracy (accounts for 60% of the -15% gap)
- **Improvement measures**: Few-shot examples, clear classification criteria, JSON output format
- **Estimated effect**: +10-12% accuracy
### Priority 2: Optimize generate_response efficiency
- **Impact**: Affects both Latency and Cost
- **Improvement measures**: Conciseness instructions, max_tokens limit, temperature adjustment
- **Estimated effect**: -0.4s latency, -$0.004 cost
```
---

View File

@@ -0,0 +1,230 @@
# Phase 3: Iterative Improvement Examples
Examples of before/after prompt comparisons and result reports.
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 3](./workflow_phase3.md) | [Prompt Optimization](./prompt_optimization.md)
---
## Phase 3: Iterative Improvement Examples
### Example 3.1: Before/After Prompt Comparison
**Node**: analyze_intent
#### Before (Baseline)
```python
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=1.0
)
messages = [
SystemMessage(content="You are an intent analyzer. Analyze user input."),
HumanMessage(content=f"Analyze: {state['user_input']}")
]
response = llm.invoke(messages)
state["intent"] = response.content
return state
```
**Issues**:
- Ambiguous instructions
- No few-shot examples
- Free text output
- High temperature
**Result**: Accuracy 75%
#### After (Iteration 1)
```python
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.3 # Lower temperature for classification tasks
)
# Clear classification categories and few-shot examples
system_prompt = """You are an intent classifier for a customer support chatbot.
Classify user input into one of these categories:
- "product_inquiry": Questions about products or services
- "technical_support": Technical issues or troubleshooting
- "billing": Payment, invoicing, or billing questions
- "general": General questions or chitchat
Output ONLY a valid JSON object with this structure:
{
"intent": "<category>",
"confidence": <0.0-1.0>,
"reasoning": "<brief explanation>"
}
Examples:
Input: "How much does the premium plan cost?"
Output: {"intent": "product_inquiry", "confidence": 0.95, "reasoning": "Question about product pricing"}
Input: "I can't log into my account"
Output: {"intent": "technical_support", "confidence": 0.9, "reasoning": "Authentication issue"}
Input: "Why was I charged twice?"
Output: {"intent": "billing", "confidence": 0.95, "reasoning": "Question about billing charges"}
Input: "Hello, how are you?"
Output: {"intent": "general", "confidence": 0.85, "reasoning": "General greeting"}
Input: "What's the return policy?"
Output: {"intent": "product_inquiry", "confidence": 0.9, "reasoning": "Question about product policy"}
"""
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=f"Input: {state['user_input']}\nOutput:")
]
response = llm.invoke(messages)
# JSON parsing (with error handling)
try:
intent_data = json.loads(response.content)
state["intent"] = intent_data["intent"]
state["confidence"] = intent_data["confidence"]
except json.JSONDecodeError:
# Fallback
state["intent"] = "general"
state["confidence"] = 0.5
return state
```
**Improvements**:
- ✅ temperature: 1.0 → 0.3
- ✅ Clear classification categories (4 intents)
- ✅ Few-shot examples (5 added)
- ✅ JSON output format (structured output)
- ✅ Error handling (fallback for JSON parsing failures)
**Result**: Accuracy 86% (+11%)
### Example 3.2: Prioritization Matrix
```markdown
## Improvement Prioritization Matrix
| Node | Impact | Feasibility | Implementation Cost | Total Score | Priority |
| ----------------- | ------------ | ------------ | ------------------- | ----------- | -------- |
| analyze_intent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 14/15 | 1st |
| generate_response | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 12/15 | 2nd |
| retrieve_context | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 8/15 | 3rd |
### Detailed Analysis
#### 1st: analyze_intent Node
- **Impact**: ⭐⭐⭐⭐⭐
- Direct impact on Accuracy (accounts for 60% of -15% gap)
- Also affects downstream nodes (chain errors from misclassification)
- **Feasibility**: ⭐⭐⭐⭐⭐
- Improvement expected from few-shot examples
- Similar cases show +10-15% improvement
- **Implementation Cost**: ⭐⭐⭐⭐
- Implementation time: 30-60 minutes
- Testing time: 30 minutes
- Risk: Low
**Iteration 1 target**: analyze_intent node
#### 2nd: generate_response Node
- **Impact**: ⭐⭐⭐⭐
- Main contributor to Latency and Cost (over 70% of total)
- Small direct impact on Accuracy
- **Feasibility**: ⭐⭐⭐⭐
- max_tokens limit ensures improvement
- Quality can be maintained with conciseness instructions
- **Implementation Cost**: ⭐⭐⭐⭐
- Implementation time: 20-30 minutes
- Testing time: 30 minutes
- Risk: Low
**Iteration 2 target**: generate_response node
```
### Example 3.3: Iteration Results Report
```markdown
# Iteration 1 Evaluation Results
Execution Date/Time: 2024-11-24 12:00:00
Changes: analyze_intent node optimization
## Result Comparison
| Metric | Baseline | Iteration 1 | Change | Change Rate | Target | Achievement |
| ------------ | -------- | ----------- | ---------- | ----------- | ------ | ----------- |
| **Accuracy** | 75.0% | **86.0%** | **+11.0%** | +14.7% | 90.0% | 95.6% |
| **Latency** | 2.5s | 2.4s | -0.1s | -4.0% | 2.0s | 80.0% |
| **Cost/req** | $0.015 | $0.014 | -$0.001 | -6.7% | $0.010 | 71.4% |
## Detailed Analysis
### Accuracy Improvement
- **Improvement**: +11.0% (75.0% → 86.0%)
- **Remaining gap**: 4.0% (Target 90.0%)
- **Improved cases**: Intent classification errors reduced from 12 → 3 cases
- **Still needs improvement**: Context understanding cases (5 cases)
### Slight Latency Improvement
- **Improvement**: -0.1s (2.5s → 2.4s)
- **Main factor**: analyze_intent output became more concise due to lower temperature
- **Remaining bottleneck**: generate_response (average 1.8s)
### Slight Cost Reduction
- **Reduction**: -$0.001 (6.7% reduction)
- **Factor**: analyze_intent output token reduction
- **Main cost**: generate_response still accounts for 73%
## Statistical Significance
- **t-test**: p < 0.01 ✅ (statistically significant)
- **Effect size**: Cohen's d = 2.3 (large effect)
- **Confidence interval**: [83.9%, 88.1%] (95% CI)
## Next Iteration Strategy
### Priority 1: Optimize generate_response
- **Goal**: Latency from 1.8s → 1.4s, Cost from $0.011 → $0.007
- **Approach**:
1. Add conciseness instructions
2. Limit max_tokens to 500
3. Adjust temperature from 0.7 → 0.5
### Priority 2: Final 4% Accuracy improvement
- **Goal**: 86.0% → 90.0% or higher
- **Approach**: Improve context understanding (retrieve_context node)
## Decision
**Continue** → Proceed to Iteration 2
Reasons:
- Accuracy improved significantly but still hasn't reached target
- Latency and Cost still have room for improvement
- Clear improvement strategy is in place
```
---

View File

@@ -0,0 +1,288 @@
# Phase 4: Completion and Documentation Examples
Examples of final reports and Git commits.
**📋 Related Documentation**: [Examples Home](./examples.md) | [Workflow Phase 4](./workflow_phase4.md)
---
## Phase 4: Completion and Documentation Examples
### Example 4.1: Final Evaluation Report
```markdown
# LangGraph Application Fine-Tuning Completion Report
Project: Customer Support Chatbot
Implementation Period: 2024-11-24 10:00 - 2024-11-24 15:00 (5 hours)
Implementer: Claude Code (fine-tune skill)
## 🎯 Executive Summary
This fine-tuning project optimized the prompts for the LangGraph chatbot application and achieved the following results:
-**Accuracy**: 75.0% → 92.0% (+17.0%, target 90% achieved)
-**Latency**: 2.5s → 1.9s (-24.0%, target 2.0s achieved)
- ⚠️ **Cost**: $0.015 → $0.011 (-26.7%, target $0.010 not achieved)
A total of 3 iterations were conducted, achieving targets for 2 out of 3 metrics.
## 📊 Implementation Summary
### Number of Iterations and Execution Time
- **Total Iterations**: 3
- **Number of Nodes Optimized**: 2 (analyze_intent, generate_response)
- **Number of Evaluation Runs**: 20 times (Baseline 5 times + 5 times after each iteration × 3)
- **Total Execution Time**: Approximately 5 hours
### Final Results
| Metric | Initial | Final | Improvement | Improvement Rate | Target | Achievement Status |
| -------- | ------- | ------ | ----------- | ---------------- | ------ | ------------------ |
| Accuracy | 75.0% | 92.0% | +17.0% | +22.7% | 90.0% | ✅ 102.2% |
| Latency | 2.5s | 1.9s | -0.6s | -24.0% | 2.0s | ✅ 95.0% |
| Cost/req | $0.015 | $0.011 | -$0.004 | -26.7% | $0.010 | ⚠️ 90.9% |
## 📝 Details by Iteration
### Iteration 1: Optimize analyze_intent Node
**Implementation Date/Time**: 2024-11-24 11:00
**Target Node**: src/nodes/analyzer.py:25-45
**Changes**:
1. temperature: 1.0 → 0.3
2. Added 5 few-shot examples
3. Structured into JSON output format
4. Defined clear classification categories (4 categories)
**Results**:
- Accuracy: 75.0% → 86.0% (+11.0%)
- Latency: 2.5s → 2.4s (-0.1s)
- Cost: $0.015 → $0.014 (-$0.001)
**Learnings**: Few-shot examples and clear output format are most effective for accuracy improvement
---
### Iteration 2: Optimize generate_response Node
**Implementation Date/Time**: 2024-11-24 13:00
**Target Node**: src/nodes/generator.py:45-68
**Changes**:
1. Added conciseness instructions ("respond in 2-3 sentences")
2. max_tokens: unlimited → 500
3. temperature: 0.7 → 0.5
4. Clarified response style
**Results**:
- Accuracy: 86.0% → 88.0% (+2.0%)
- Latency: 2.4s → 2.0s (-0.4s)
- Cost: $0.014 → $0.011 (-$0.003)
**Learnings**: max_tokens limit significantly contributes to latency and cost reduction
---
### Iteration 3: Additional Improvements to analyze_intent
**Implementation Date/Time**: 2024-11-24 14:30
**Target Node**: src/nodes/analyzer.py:25-45
**Changes**:
1. Increased few-shot examples from 5 → 10
2. Added edge case handling
3. Reclassification logic based on confidence threshold
**Results**:
- Accuracy: 88.0% → 92.0% (+4.0%)
- Latency: 2.0s → 1.9s (-0.1s)
- Cost: $0.011 → $0.011 (±0)
**Learnings**: Additional few-shot examples broke through the final accuracy barrier
## 🔧 Final Changes Summary
### src/nodes/analyzer.py
**Changed Lines**: 25-45
**Main Changes**:
- temperature: 1.0 → 0.3
- Few-shot examples: 0 → 10
- Output: Free text → JSON
- Added fallback based on confidence threshold
---
### src/nodes/generator.py
**Changed Lines**: 45-68
**Main Changes**:
- temperature: 0.7 → 0.5
- max_tokens: unlimited → 500
- Clear conciseness instructions ("2-3 sentences")
- Added response style guidelines
## 📈 Detailed Evaluation Results
### Improvement Status by Test Case
| Case ID | Category | Before | After | Improvement |
| ------- | --------- | ----------- | ----------- | ----------- |
| TC001 | Product | ❌ Wrong | ✅ Correct | ✅ |
| TC002 | Technical | ❌ Wrong | ✅ Correct | ✅ |
| TC003 | Billing | ✅ Correct | ✅ Correct | - |
| ... | ... | ... | ... | ... |
| TC020 | Technical | ✅ Correct | ✅ Correct | - |
**Improved Cases**: 15/20 (75%)
**Maintained Cases**: 5/20 (25%)
**Degraded Cases**: 0/20 (0%)
### Latency Breakdown
| Node | Before | After | Change | Change Rate |
| ----------------- | ------ | ----- | ------ | ----------- |
| analyze_intent | 0.5s | 0.4s | -0.1s | -20% |
| retrieve_context | 0.2s | 0.2s | ±0s | 0% |
| generate_response | 1.8s | 1.3s | -0.5s | -28% |
| **Total** | **2.5s** | **1.9s** | **-0.6s** | **-24%** |
### Cost Breakdown
| Node | Before | After | Change | Change Rate |
| ----------------- | ------- | ------- | -------- | ----------- |
| analyze_intent | $0.003 | $0.003 | ±$0 | 0% |
| retrieve_context | $0.001 | $0.001 | ±$0 | 0% |
| generate_response | $0.011 | $0.007 | -$0.004 | -36% |
| **Total** | **$0.015** | **$0.011** | **-$0.004** | **-27%** |
## 💡 Future Recommendations
### Short-term (1-2 weeks)
1. **Achieve Cost Target**: $0.011 → $0.010
- Approach: Consider partial migration to Claude 3.5 Haiku
- Estimated effect: -$0.002-0.003/req
2. **Further Accuracy Improvement**: 92.0% → 95.0%
- Approach: Analyze error cases and add few-shot examples
- Estimated effect: +3.0%
### Mid-term (1-2 months)
1. **Model Optimization**
- Use Haiku for simple intent classification
- Use Sonnet only for complex response generation
- Estimated effect: -30-40% cost, minimal impact on latency
2. **Utilize Prompt Caching**
- Cache system prompts and few-shot examples
- Estimated effect: -50% cost (when cache hits)
### Long-term (3-6 months)
1. **Consider Fine-tuned Models**
- Model fine-tuning with proprietary data
- Concise prompts without few-shot examples
- Estimated effect: -60% cost, +5% accuracy
## 🎓 Conclusion
This project achieved the following through fine-tuning the LangGraph application:
**Successes**:
1. Significant accuracy improvement (+22.7%) - Exceeded target by 2.2%
2. Notable latency improvement (-24.0%) - Exceeded target by 5%
3. Cost reduction (-26.7%) - 9.1% away from target
⚠️ **Challenges**:
1. Cost target not achieved ($0.011 vs $0.010 target) - Can be addressed by migrating to lighter models
📈 **Business Impact**:
- Improved user satisfaction (due to accuracy improvement)
- Reduced operational costs (due to latency and cost reduction)
- Improved scalability (efficient resource usage)
🎯 **Next Steps**:
1. Verify migration to lighter models for cost reduction
2. Continuous monitoring and evaluation
3. Expand to new use cases
---
Created Date/Time: 2024-11-24 15:00:00
Creator: Claude Code (fine-tune skill)
```
### Example 4.2: Git Commit Message Examples
```bash
# Iteration 1 commit
git commit -m "feat(nodes): optimize analyze_intent prompt for accuracy
- Add temperature control (1.0 -> 0.3) for deterministic classification
- Add 5 few-shot examples for intent categories
- Implement JSON structured output format
- Add error handling for JSON parsing failures
Results:
- Accuracy: 75.0% -> 86.0% (+11.0%)
- Latency: 2.5s -> 2.4s (-0.1s)
- Cost: \$0.015 -> \$0.014 (-\$0.001)
Related: fine-tune iteration 1
See: evaluation_results/iteration_1/"
# Iteration 2 commit
git commit -m "feat(nodes): optimize generate_response for latency and cost
- Add conciseness guidelines (2-3 sentences)
- Set max_tokens limit to 500
- Adjust temperature (0.7 -> 0.5) for consistency
- Define response style and tone
Results:
- Accuracy: 86.0% -> 88.0% (+2.0%)
- Latency: 2.4s -> 2.0s (-0.4s, -17%)
- Cost: \$0.014 -> \$0.011 (-\$0.003, -21%)
Related: fine-tune iteration 2
See: evaluation_results/iteration_2/"
# Final commit
git commit -m "feat(nodes): finalize fine-tuning with additional improvements
Complete fine-tuning process with 3 iterations:
- analyze_intent: 10 few-shot examples, confidence threshold
- generate_response: conciseness and style optimization
Final Results:
- Accuracy: 75.0% -> 92.0% (+17.0%, goal 90% ✅)
- Latency: 2.5s -> 1.9s (-0.6s, -24%, goal 2.0s ✅)
- Cost: \$0.015 -> \$0.011 (-\$0.004, -27%, goal \$0.010 ⚠️)
Related: fine-tune completion
See: evaluation_results/final_report.md"
# Evaluation results commit
git commit -m "docs: add fine-tuning evaluation results and final report
- Baseline evaluation (5 iterations)
- Iteration 1-3 results
- Final comprehensive report
- Statistical analysis and recommendations"
```
---
## 📚 Related Documentation
- [SKILL.md](SKILL.md) - Skill overview
- [workflow.md](workflow.md) - Workflow details
- [evaluation.md](evaluation.md) - Evaluation methods
- [prompt_optimization.md](prompt_optimization.md) - Optimization techniques

View File

@@ -0,0 +1,65 @@
# Prompt Optimization Guide
A comprehensive guide for effectively optimizing prompts in LangGraph nodes.
## 📚 Table of Contents
This guide is divided into the following sections:
### 1. [Prompt Optimization Principles](./prompt_principles.md)
Learn the fundamental principles for designing prompts.
### 2. [Prompt Optimization Techniques](./prompt_techniques.md)
Provides a collection of practical optimization techniques (10 techniques).
### 3. [Optimization Priorities](./prompt_priorities.md)
Explains how to apply optimization techniques in order of improvement impact.
## 🎯 Quick Start
### First-Time Optimization
1. **[Understand the Principles](./prompt_principles.md)** - Learn the basics of clarity, structure, and specificity
2. **[Start with High-Impact Techniques](./prompt_priorities.md)** - Few-Shot Examples, output format structuring, parameter tuning
3. **[Review Technique Details](./prompt_techniques.md)** - Implementation methods and effects of each technique
### Improving Existing Prompts
1. **Measure Baseline** - Record current performance
2. **[Refer to Priority Guide](./prompt_priorities.md)** - Select the most impactful improvements
3. **[Apply Techniques](./prompt_techniques.md)** - Implement one at a time and measure effects
4. **Iterate** - Repeat the cycle of measure, implement, validate
## 📖 Related Documentation
- **[Prompt Optimization Examples](./examples.md)** - Before/After comparison examples and code templates
- **[SKILL.md](./SKILL.md)** - Overview and usage of the Fine-tune skill
- **[evaluation.md](./evaluation.md)** - Evaluation criteria design and measurement methods
## 💡 Best Practices
For effective prompt optimization:
1.**Measurement-Driven**: Evaluate all changes quantitatively
2.**Incremental Improvement**: One change at a time, measure, validate
3.**Cost-Conscious**: Optimize with model selection, caching, max_tokens
4.**Task-Appropriate**: Select techniques based on task complexity
5.**Iterative Approach**: Maintain continuous improvement cycles
## 🔍 Troubleshooting
### Low Prompt Quality
→ Review [Prompt Optimization Principles](./prompt_principles.md)
### Insufficient Accuracy
→ Apply [Few-Shot Examples](./prompt_techniques.md#technique-1-few-shot-examples) or [Chain-of-Thought](./prompt_techniques.md#technique-2-chain-of-thought)
### High Latency
→ Implement [Temperature/Max Tokens Adjustment](./prompt_techniques.md#technique-4-temperature-and-max-tokens-adjustment) or [Output Format Structuring](./prompt_techniques.md#technique-3-output-format-structuring)
### High Cost
→ Introduce [Model Selection Optimization](./prompt_techniques.md#technique-10-model-selection) or [Prompt Caching](./prompt_techniques.md#technique-6-prompt-caching)
---
**💡 Tip**: For before/after prompt comparison examples and code templates, refer to [examples.md](examples.md#phase-3-iterative-improvement-examples).

View File

@@ -0,0 +1,84 @@
# Prompt Optimization Principles
Fundamental principles for designing prompts in LangGraph nodes.
## 🎯 Prompt Optimization Principles
### 1. Clarity
**Bad Example**:
```python
SystemMessage(content="Analyze the input.")
```
**Good Example**:
```python
SystemMessage(content="""You are an intent classifier for customer support.
Task: Classify user input into one of these categories:
- product_inquiry: Questions about products or services
- technical_support: Technical issues or troubleshooting
- billing: Payment or billing questions
- general: General questions or greetings
Output only the category name.""")
```
**Improvements**:
- ✅ Clearly defined role
- ✅ Specific task description
- ✅ Enumerated categories
- ✅ Specified output format
### 2. Structure
**Bad Example**:
```python
prompt = f"Answer this: {question}"
```
**Good Example**:
```python
prompt = f"""Context:
{context}
Question:
{question}
Instructions:
1. Base your answer on the provided context
2. Be concise (2-3 sentences maximum)
3. If the answer is not in the context, say "I don't have enough information"
Answer:"""
```
**Improvements**:
- ✅ Sectioned (Context, Question, Instructions, Answer)
- ✅ Sequential instructions
- ✅ Clear separators
### 3. Specificity
**Bad Example**:
```python
"Be helpful and friendly."
```
**Good Example**:
```python
"""Tone and Style:
- Use a warm, professional tone
- Address the customer by name if available
- Acknowledge their concern explicitly
- Provide actionable next steps
Example:
"Hi Sarah, I understand your concern about the billing charge. Let me review your account and get back to you within 24 hours with a detailed explanation."
"""
```
**Improvements**:
- ✅ Specific guidelines
- ✅ Concrete examples provided
- ✅ Measurable criteria

View File

@@ -0,0 +1,87 @@
# Prompt Optimization Priorities
A priority guide for applying optimization techniques in order of improvement impact.
## 📊 Optimization Priorities
In order of improvement impact:
### 1. Adding Few-Shot Examples (High Impact, Low Cost)
- **Improvement**: Accuracy +10-20%
- **Cost**: +5-10% (increased input tokens)
- **Implementation Time**: 30 minutes - 1 hour
- **Recommended**: ⭐⭐⭐⭐⭐
### 2. Output Format Structuring (High Impact, Low Cost)
- **Improvement**: Latency -10-20%, Parsing errors -90%
- **Cost**: ±0%
- **Implementation Time**: 15-30 minutes
- **Recommended**: ⭐⭐⭐⭐⭐
### 3. Temperature/Max Tokens Adjustment (Medium Impact, Zero Cost)
- **Improvement**: Latency -10-30%, Cost -20-40%
- **Cost**: Reduction
- **Implementation Time**: 10-15 minutes
- **Recommended**: ⭐⭐⭐⭐⭐
### 4. Clear Instructions and Guidelines (Medium Impact, Low Cost)
- **Improvement**: Accuracy +5-10%, Quality +15-25%
- **Cost**: +2-5%
- **Implementation Time**: 30 minutes - 1 hour
- **Recommended**: ⭐⭐⭐⭐
### 5. Model Selection Optimization (High Impact, Requires Validation)
- **Improvement**: Cost -40-60%
- **Risk**: Accuracy -2-5%
- **Implementation Time**: 2-4 hours (including validation)
- **Recommended**: ⭐⭐⭐⭐
### 6. Prompt Caching (High Impact, Medium Cost)
- **Improvement**: Cost -50-90% (on cache hit)
- **Complexity**: Medium (implementation and monitoring)
- **Implementation Time**: 1-2 hours
- **Recommended**: ⭐⭐⭐⭐
### 7. Chain-of-Thought (High Impact for Specific Tasks)
- **Improvement**: Accuracy +15-30% for complex tasks
- **Cost**: +20-40%
- **Implementation Time**: 1-2 hours
- **Recommended**: ⭐⭐⭐ (complex tasks only)
### 8. Self-Consistency (Limited Use)
- **Improvement**: Accuracy +10-20%
- **Cost**: +200-300%
- **Implementation Time**: 2-3 hours
- **Recommended**: ⭐⭐ (critical decisions only)
## 🔄 Iterative Optimization Process
```
1. Measure baseline
2. Select the most impactful improvement
3. Implement (one change only)
4. Evaluate (with same test cases)
5. Is improvement confirmed?
├─ Yes → Keep change, go to step 2
└─ No → Rollback change, try different improvement
6. Goal achieved?
├─ Yes → Complete
└─ No → Go to step 2
```
## Summary
For effective prompt optimization:
1.**Clarity**: Clear role, task, and output format
2.**Few-Shot Examples**: 3-7 high-quality examples
3.**Structuring**: Structured output like JSON
4.**Parameter Tuning**: Task-appropriate temperature/max_tokens
5.**Incremental Improvement**: One change at a time, measure, validate
6.**Cost-Conscious**: Model selection, caching, max_tokens
7.**Measurement-Driven**: Evaluate all changes quantitatively

View File

@@ -0,0 +1,425 @@
# Prompt Optimization Techniques
A collection of practical techniques for effectively optimizing prompts in LangGraph nodes.
**💡 Tip**: For before/after prompt comparison examples and code templates, refer to [examples.md](examples.md#phase-3-iterative-improvement-examples).
## 🔧 Practical Optimization Techniques
### Technique 1: Few-Shot Examples
**Effect**: Accuracy +10-20%
**Before (Zero-shot)**:
```python
system_prompt = """Classify user input into: product_inquiry, technical_support, billing, or general."""
# Accuracy: ~70%
```
**After (Few-shot)**:
```python
system_prompt = """Classify user input into: product_inquiry, technical_support, billing, or general.
Examples:
Input: "How much does the premium plan cost?"
Output: product_inquiry
Input: "I can't log into my account"
Output: technical_support
Input: "Why was I charged twice this month?"
Output: billing
Input: "Hello, how are you today?"
Output: general
Input: "What features are included in the basic plan?"
Output: product_inquiry"""
# Accuracy: ~85-90%
```
**Best Practices**:
- **Number of Examples**: 3-7 (diminishing returns beyond this)
- **Diversity**: At least one from each category, including edge cases
- **Quality**: Select clear and unambiguous examples
- **Format**: Consistent Input/Output format
### Technique 2: Chain-of-Thought
**Effect**: Accuracy +15-30% for complex reasoning tasks
**Before (Direct answer)**:
```python
prompt = f"""Question: {question}
Answer:"""
# Many incorrect answers for complex questions
```
**After (Chain-of-Thought)**:
```python
prompt = f"""Question: {question}
Think through this step by step:
1. First, identify the key information needed
2. Then, analyze the context for relevant details
3. Finally, formulate a clear answer
Reasoning:"""
# Logical answers even for complex questions
```
**Application Scenarios**:
- ✅ Tasks requiring multi-step reasoning
- ✅ Complex decision making
- ✅ Resolving contradictions
- ❌ Simple classification tasks (overhead)
### Technique 3: Output Format Structuring
**Effect**: Latency -10-20%, Parsing errors -90%
**Before (Free text)**:
```python
prompt = "Classify the intent and explain why."
# Output: "This looks like a technical support question because the user is having trouble logging in..."
# Problems: Hard to parse, verbose, inconsistent
```
**After (JSON structured)**:
```python
prompt = """Classify the intent.
Output ONLY a valid JSON object:
{
"intent": "<category>",
"confidence": <0.0-1.0>,
"reasoning": "<brief explanation in one sentence>"
}
Example output:
{"intent": "technical_support", "confidence": 0.95, "reasoning": "User reports authentication issue"}"""
# Output: {"intent": "technical_support", "confidence": 0.95, "reasoning": "User reports authentication issue"}
# Benefits: Easy to parse, concise, consistent
```
**JSON Parsing Error Handling**:
```python
import json
import re
def parse_llm_json_output(output: str) -> dict:
"""Robustly parse LLM JSON output"""
try:
# Parse as JSON directly
return json.loads(output)
except json.JSONDecodeError:
# Extract JSON only (from markdown code blocks, etc.)
json_match = re.search(r'\{[^}]+\}', output)
if json_match:
try:
return json.loads(json_match.group())
except json.JSONDecodeError:
pass
# Fallback
return {
"intent": "general",
"confidence": 0.5,
"reasoning": "Failed to parse LLM output"
}
```
### Technique 4: Temperature and Max Tokens Adjustment
**Temperature Effects**:
| Task Type | Recommended Temperature | Reason |
|-----------|------------------------|--------|
| Classification/Extraction | 0.0 - 0.3 | Deterministic output desired |
| Summarization/Transformation | 0.3 - 0.5 | Some flexibility needed |
| Creative/Generation | 0.7 - 1.0 | Diversity and creativity important |
**Before (Default settings)**:
```python
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=1.0 # Default, used for all tasks
)
# Unstable results for classification tasks
```
**After (Optimized per task)**:
```python
# Intent classification: Low temperature
intent_llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.3 # Emphasize consistency
)
# Response generation: Medium temperature
response_llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.5, # Balance flexibility
max_tokens=500 # Enforce conciseness
)
```
**Max Tokens Effects**:
```python
# Before: No limit
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
# Average output: 800 tokens, Cost: $0.012/req, Latency: 3.2s
# After: Appropriate limit
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
max_tokens=500 # Necessary and sufficient length
)
# Average output: 450 tokens, Cost: $0.007/req (-42%), Latency: 1.8s (-44%)
```
### Technique 5: System Message vs Human Message Usage
**System Message**:
- **Use**: Role, guidelines, constraints
- **Characteristics**: Context applied to entire task
- **Caching**: Effective (doesn't change frequently)
**Human Message**:
- **Use**: Specific input, questions
- **Characteristics**: Changes per request
- **Caching**: Less effective
**Good Structure**:
```python
messages = [
SystemMessage(content="""You are a customer support assistant.
Guidelines:
- Be concise: 2-3 sentences maximum
- Be empathetic: Acknowledge customer concerns
- Be actionable: Provide clear next steps
Response format:
1. Acknowledgment
2. Answer or solution
3. Next steps (if applicable)"""),
HumanMessage(content=f"""Customer question: {user_input}
Context: {context}
Generate a helpful response:""")
]
```
### Technique 6: Prompt Caching
**Effect**: Cost -50-90% (on cache hit)
Leverage Anthropic Claude's prompt caching:
```python
from anthropic import Anthropic
client = Anthropic()
# Large cacheable system prompt
CACHED_SYSTEM_PROMPT = """You are an expert customer support agent...
[Long guidelines, examples, and context - 1000+ tokens]
Examples:
[50 few-shot examples]
"""
# Use cache
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system=[
{
"type": "text",
"text": CACHED_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Enable caching
}
],
messages=[
{"role": "user", "content": user_input}
]
)
# First time: Full cost
# 2nd+ time (within 5 minutes): Input tokens -90% discount
```
**Caching Strategy**:
- ✅ Large system prompts (>1024 tokens)
- ✅ Sets of few-shot examples
- ✅ Long context (RAG documents)
- ❌ Frequently changing content
- ❌ Small prompts (<1024 tokens)
### Technique 7: Progressive Refinement
Break complex tasks into multiple steps:
**Before (1 step)**:
```python
# Execute everything in one node
prompt = f"""Analyze user input, retrieve relevant info, and generate response.
Input: {user_input}"""
# Problems: Too complex, low quality, hard to debug
```
**After (Multiple steps)**:
```python
# Step 1: Intent classification
intent = classify_intent(user_input)
# Step 2: Information retrieval (based on intent)
context = retrieve_context(intent, user_input)
# Step 3: Response generation (using intent and context)
response = generate_response(intent, context, user_input)
# Benefits: Each step optimizable, easy to debug, improved quality
```
### Technique 8: Negative Instructions
**Effect**: Edge case errors -30-50%
```python
prompt = """Generate a customer support response.
DO:
- Be concise (2-3 sentences)
- Acknowledge the customer's concern
- Provide actionable next steps
DO NOT:
- Apologize excessively (one apology maximum)
- Make promises you can't keep (e.g., "immediate resolution")
- Use technical jargon without explanation
- Provide information not in the context
- Generate placeholder text like "XXX" or "[insert here]"
Customer question: {question}
Context: {context}
Response:"""
```
### Technique 9: Self-Consistency
**Effect**: Accuracy +10-20% for complex reasoning, Cost +200-300%
Generate multiple reasoning paths and use majority voting:
```python
def self_consistency_reasoning(question: str, num_samples: int = 5) -> str:
"""Generate multiple reasoning paths and select the most consistent answer"""
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.7 # Higher temperature for diversity
)
prompt = f"""Question: {question}
Think through this step by step and provide your reasoning:
Reasoning:"""
# Generate multiple reasoning paths
responses = []
for _ in range(num_samples):
response = llm.invoke([HumanMessage(content=prompt)])
responses.append(response.content)
# Extract the most consistent answer (simplified)
# In practice, extract final answer from each response and use majority voting
from collections import Counter
final_answers = [extract_final_answer(r) for r in responses]
most_common = Counter(final_answers).most_common(1)[0][0]
return most_common
# Trade-offs:
# - Accuracy: +10-20%
# - Cost: +200-300% (5x API calls)
# - Latency: +200-300% (if not parallelized)
# Use: Critical decisions only
```
### Technique 10: Model Selection
**Model Selection Based on Task Complexity**:
| Task Type | Recommended Model | Reason |
|-----------|------------------|--------|
| Simple classification | Claude 3.5 Haiku | Fast, low cost, sufficient accuracy |
| Complex reasoning | Claude 3.5 Sonnet | Balanced performance |
| Highly complex tasks | Claude Opus | Best performance (high cost) |
```python
# Select optimal model per task
class LLMSelector:
def __init__(self):
self.haiku = ChatAnthropic(model="claude-3-5-haiku-20241022")
self.sonnet = ChatAnthropic(model="claude-3-5-sonnet-20241022")
self.opus = ChatAnthropic(model="claude-opus-20240229")
def get_llm(self, task_complexity: str):
if task_complexity == "simple":
return self.haiku # ~$0.001/req
elif task_complexity == "complex":
return self.sonnet # ~$0.005/req
else: # very_complex
return self.opus # ~$0.015/req
# Usage example
selector = LLMSelector()
# Simple intent classification → Haiku
intent_llm = selector.get_llm("simple")
# Complex response generation → Sonnet
response_llm = selector.get_llm("complex")
```
**Hybrid Approach**:
```python
def hybrid_classification(user_input: str) -> dict:
"""Try Haiku first, use Sonnet if confidence is low"""
# Step 1: Classify with Haiku
haiku_result = classify_with_haiku(user_input)
if haiku_result["confidence"] >= 0.8:
# High confidence → Use Haiku result
return haiku_result
else:
# Low confidence → Re-classify with Sonnet
sonnet_result = classify_with_sonnet(user_input)
return sonnet_result
# Effects:
# - 80% of cases use Haiku (low cost)
# - 20% of cases use Sonnet (high accuracy)
# - Average cost: -60%
# - Average accuracy: -2% (acceptable range)
```

View File

@@ -0,0 +1,127 @@
# Fine-Tuning Workflow Details
Detailed workflow and practical guidelines for executing fine-tuning of LangGraph applications.
**💡 Tip**: For concrete code examples and templates you can copy and paste, refer to [examples.md](examples.md).
## 📋 Workflow Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Preparation and Analysis │
├─────────────────────────────────────────────────────────────┤
│ 1. Read fine-tune.md → Understand goals and criteria │
│ 2. Identify optimization targets with Serena → List LLM nodes│
│ 3. Create optimization list → Assess improvement potential │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Baseline Evaluation │
├─────────────────────────────────────────────────────────────┤
│ 4. Prepare evaluation environment → Test cases, scripts │
│ 5. Measure baseline → Run 3-5 times, collect statistics │
│ 6. Analyze results → Identify issues, assess improvement │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Iterative Improvement (Iteration Loop) │
├─────────────────────────────────────────────────────────────┤
│ 7. Prioritize → Select most effective improvement area │
│ 8. Implement improvements → Optimize prompts, adjust params │
│ 9. Post-improvement evaluation → Re-evaluate same conditions│
│ 10. Compare results → Measure improvement, decide next step │
│ 11. Continue decision → Goal met? Yes → Phase 4 / No → Next │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: Completion and Documentation │
├─────────────────────────────────────────────────────────────┤
│ 12. Create final evaluation report → Summary of improvements│
│ 13. Commit code → Version control and documentation update │
└─────────────────────────────────────────────────────────────┘
```
## 📚 Phase-by-Phase Detailed Guide
### [Phase 1: Preparation and Analysis](./workflow_phase1.md)
Clarify optimization direction and identify targets for improvement:
- **Step 1**: Read and understand fine-tune.md
- **Step 2**: Identify optimization targets with Serena MCP
- **Step 3**: Create optimization target list
**Time Required**: 30 minutes - 1 hour
### [Phase 2: Baseline Evaluation](./workflow_phase2.md)
Quantitatively measure current performance:
- **Step 4**: Prepare evaluation environment
- **Step 5**: Measure baseline (3-5 runs)
- **Step 6**: Analyze baseline results
**Time Required**: 1-2 hours
### [Phase 3: Iterative Improvement](./workflow_phase3.md)
Data-driven, incremental prompt optimization:
- **Step 7**: Prioritization
- **Step 8**: Implement improvements
- **Step 9**: Post-improvement evaluation
- **Step 10**: Compare results
- **Step 11**: Continue decision
**Time Required**: 1-2 hours per iteration × number of iterations (typically 3-5)
### [Phase 4: Completion and Documentation](./workflow_phase4.md)
Record final results and commit code:
- **Step 12**: Create final evaluation report
- **Step 13**: Commit code and update documentation
**Time Required**: 30 minutes - 1 hour
## 🎯 Workflow Execution Points
### For First-Time Fine-Tuning
1. **Start from Phase 1 in order**: Execute all phases without skipping
2. **Create documentation**: Record results from each phase
3. **Start small**: Experiment with a small number of test cases initially
### Continuous Fine-Tuning
1. **Start from Phase 2**: Measure new baseline
2. **Repeat Phase 3**: Continuous improvement cycle
3. **Consider automation**: Build evaluation pipeline
## 📊 Principles for Success
1. **Data-Driven**: Base all decisions on measurement results
2. **Incremental Improvement**: One change at a time, measure, verify
3. **Documentation**: Record results and learnings from each phase
4. **Statistical Verification**: Run multiple times to confirm significance
## 🔗 Related Documents
- **[Example Collection](./examples.md)** - Code examples and templates for each phase
- **[Evaluation Methods](./evaluation.md)** - Details on evaluation metrics and statistical analysis
- **[Prompt Optimization](./prompt_optimization.md)** - Detailed optimization techniques
- **[SKILL.md](./SKILL.md)** - Overview of the Fine-tune skill
## 💡 Troubleshooting
### Cannot find optimization targets in Phase 1
→ Check search patterns in [workflow_phase1.md#step-2](./workflow_phase1.md#step-2-identify-optimization-targets-with-serena-mcp)
### Evaluation script fails in Phase 2
→ Check checklist in [workflow_phase2.md#step-4](./workflow_phase2.md#step-4-prepare-evaluation-environment)
### No improvement effect in Phase 3
→ Review priority matrix in [workflow_phase3.md#step-7](./workflow_phase3.md#step-7-prioritization)
### Report creation takes too long in Phase 4
→ Utilize templates in [workflow_phase4.md#step-12](./workflow_phase4.md#step-12-create-final-evaluation-report)
---
Following this workflow enables:
- ✅ Systematic fine-tuning process execution
- ✅ Data-driven decision making
- ✅ Continuous improvement and verification
- ✅ Complete documentation and traceability

View File

@@ -0,0 +1,229 @@
# Phase 1: Preparation and Analysis
Preparation phase to clarify optimization direction and identify targets for improvement.
**Time Required**: 30 minutes - 1 hour
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)
---
## Phase 1: Preparation and Analysis
### Step 1: Read and Understand fine-tune.md
**Purpose**: Clarify optimization direction
**Execution**:
```python
# Read .langgraph-master/fine-tune.md
file_path = ".langgraph-master/fine-tune.md"
with open(file_path, "r") as f:
fine_tune_spec = f.read()
# Extract the following information:
# - Optimization goals (accuracy, latency, cost, etc.)
# - Evaluation methods (test cases, metrics, calculation methods)
# - Passing criteria (target values for each metric)
# - Test data location
```
**Typical fine-tune.md structure**:
```markdown
# Fine-Tuning Goals
## Optimization Objectives
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
- **Latency**: Reduce response time to 2.0 seconds or less
- **Cost**: Reduce cost per request to $0.010 or less
## Evaluation Methods
- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
- **Execution Command**: uv run python -m src.evaluate
- **Evaluation Script**: tests/evaluation/evaluator.py
## Evaluation Metrics
### Accuracy
- Calculation method: (Correct count / Total cases) × 100
- Target value: 90% or higher
### Latency
- Calculation method: Average time per execution
- Target value: 2.0 seconds or less
### Cost
- Calculation method: Total API cost / Total requests
- Target value: $0.010 or less
## Passing Criteria
All evaluation metrics must achieve their target values
```
### Step 2: Identify Optimization Targets with Serena MCP
**Purpose**: Comprehensively identify nodes calling LLMs
**Execution Steps**:
1. **Search for LLM clients**
```python
# Use Serena MCP: find_symbol
# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.
patterns = [
"ChatAnthropic",
"ChatOpenAI",
"ChatGoogleGenerativeAI",
"ChatVertexAI"
]
llm_usages = []
for pattern in patterns:
results = serena.find_symbol(
name_path=pattern,
substring_matching=True,
include_body=False
)
llm_usages.extend(results)
```
2. **Identify prompt construction locations**
```python
# For each LLM call, investigate how prompts are constructed
for usage in llm_usages:
# Get surrounding context with find_referencing_symbols
context = serena.find_referencing_symbols(
name_path=usage.name,
relative_path=usage.file_path
)
# Identify prompt templates and message construction logic
# - Use of ChatPromptTemplate
# - SystemMessage, HumanMessage definitions
# - Prompt construction with f-strings or format()
```
3. **Per-node analysis**
```python
# Analyze LLM usage patterns within each node function
# - Prompt clarity
# - Presence of few-shot examples
# - Structured output format
# - Parameter settings (temperature, max_tokens, etc.)
```
**Example Output**:
```markdown
## LLM Call Location Analysis
### 1. analyze_intent node
- **File**: src/nodes/analyzer.py
- **Line numbers**: 25-45
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
```python
SystemMessage: "You are an intent analyzer..."
HumanMessage: f"Analyze: {user_input}"
```
- **Improvement potential**: ⭐⭐⭐⭐⭐ (High)
- Prompt is vague ("Analyze" criteria unclear)
- No few-shot examples
- Output format is free text
- **Estimated improvement effect**: Accuracy +10-15%
### 2. generate_response node
- **File**: src/nodes/generator.py
- **Line numbers**: 45-68
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
```python
ChatPromptTemplate.from_messages([
("system", "Generate helpful response..."),
("human", "{context}\n\nQuestion: {question}")
])
```
- **Improvement potential**: ⭐⭐⭐ (Medium)
- Prompt is structured but lacks conciseness instructions
- No max_tokens limit → possibility of verbose output
- **Estimated improvement effect**: Latency -0.3-0.5s, Cost -20-30%
```
### Step 3: Create Optimization Target List
**Purpose**: Organize information to determine improvement priorities
**List Creation Template**:
```markdown
# Optimization Target List
## Node: analyze_intent
### Basic Information
- **File**: src/nodes/analyzer.py:25-45
- **Role**: Classify user input intent
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=1.0, max_tokens=default
### Current Prompt
```python
SystemMessage(content="You are an intent analyzer. Analyze user input.")
HumanMessage(content=f"Analyze: {user_input}")
```
### Issues
1. **Vague instructions**: Specific criteria for "Analyze" unclear
2. **No few-shot**: No expected output examples
3. **Undefined output format**: Unstructured free text
4. **High temperature**: 1.0 is too high for classification tasks
### Improvement Ideas
1. Specify concrete classification categories
2. Add 3-5 few-shot examples
3. Specify JSON output format
4. Lower temperature to 0.3-0.5
### Estimated Improvement Effect
- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
- **Latency**: ±0 (No change)
- **Cost**: ±0 (No change)
### Priority
⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement
---
## Node: generate_response
### Basic Information
- **File**: src/nodes/generator.py:45-68
- **Role**: Generate final user-facing response
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=0.7, max_tokens=default
### Current Prompt
```python
ChatPromptTemplate.from_messages([
("system", "Generate helpful response based on context."),
("human", "{context}\n\nQuestion: {question}")
])
```
### Issues
1. **No verbosity control**: No conciseness instructions
2. **max_tokens not set**: Possibility of unnecessarily long output
3. **Undefined response style**: No tone or style specifications
### Improvement Ideas
1. Add length instructions like "be concise" "in 2-3 sentences"
2. Limit max_tokens to 500
3. Clarify response style ("friendly" "professional" etc.)
### Estimated Improvement Effect
- **Accuracy**: ±0 (No change)
- **Latency**: -0.3-0.5s (Due to reduced output tokens)
- **Cost**: -20-30% (Due to reduced token count)
### Priority
⭐⭐⭐ (Medium) - Improvement in latency and cost
```

View File

@@ -0,0 +1,222 @@
# Phase 2: Baseline Evaluation
Phase to quantitatively measure current performance.
**Time Required**: 1-2 hours
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Evaluation Methods](./evaluation.md)
---
## Phase 2: Baseline Evaluation
### Step 4: Prepare Evaluation Environment
**Checklist**:
- [ ] Test case files exist
- [ ] Evaluation script is executable
- [ ] Environment variables (API keys, etc.) are set
- [ ] Dependency packages are installed
**Execution Example**:
```bash
# Check test cases
cat tests/evaluation/test_cases.json
# Verify evaluation script works
uv run python -m src.evaluate --dry-run
# Verify environment variables
echo $ANTHROPIC_API_KEY
```
### Step 5: Measure Baseline
**Recommended Run Count**: 3-5 times (for statistical reliability)
**Execution Script Example**:
```bash
#!/bin/bash
# baseline_evaluation.sh
ITERATIONS=5
RESULTS_DIR="evaluation_results/baseline"
mkdir -p $RESULTS_DIR
for i in $(seq 1 $ITERATIONS); do
echo "Running baseline evaluation: iteration $i/$ITERATIONS"
uv run python -m src.evaluate \
--output "$RESULTS_DIR/run_$i.json" \
--verbose
# API rate limit countermeasure
sleep 5
done
# Aggregate results
uv run python -m src.aggregate_results \
--input-dir "$RESULTS_DIR" \
--output "$RESULTS_DIR/summary.json"
```
**Evaluation Script Example** (`src/evaluate.py`):
```python
import json
import time
from pathlib import Path
from typing import Dict, List
def evaluate_test_cases(test_cases: List[Dict]) -> Dict:
"""Evaluate test cases"""
results = {
"total_cases": len(test_cases),
"correct": 0,
"total_latency": 0.0,
"total_cost": 0.0,
"case_results": []
}
for case in test_cases:
start_time = time.time()
# Execute LangGraph application
output = run_langgraph_app(case["input"])
latency = time.time() - start_time
# Correct answer judgment
is_correct = output["answer"] == case["expected_answer"]
if is_correct:
results["correct"] += 1
# Cost calculation (from token usage)
cost = calculate_cost(output["token_usage"])
results["total_latency"] += latency
results["total_cost"] += cost
results["case_results"].append({
"case_id": case["id"],
"correct": is_correct,
"latency": latency,
"cost": cost
})
# Calculate metrics
results["accuracy"] = (results["correct"] / results["total_cases"]) * 100
results["avg_latency"] = results["total_latency"] / results["total_cases"]
results["avg_cost"] = results["total_cost"] / results["total_cases"]
return results
def calculate_cost(token_usage: Dict) -> float:
"""Calculate cost from token usage"""
# Claude 3.5 Sonnet pricing
INPUT_COST_PER_1M = 3.0 # $3.00 per 1M input tokens
OUTPUT_COST_PER_1M = 15.0 # $15.00 per 1M output tokens
input_cost = (token_usage["input_tokens"] / 1_000_000) * INPUT_COST_PER_1M
output_cost = (token_usage["output_tokens"] / 1_000_000) * OUTPUT_COST_PER_1M
return input_cost + output_cost
```
### Step 6: Analyze Baseline Results
**Aggregation Script Example** (`src/aggregate_results.py`):
```python
import json
import numpy as np
from pathlib import Path
from typing import List, Dict
def aggregate_results(results_dir: Path) -> Dict:
"""Aggregate multiple execution results"""
all_results = []
for result_file in sorted(results_dir.glob("run_*.json")):
with open(result_file) as f:
all_results.append(json.load(f))
# Calculate statistics for each metric
accuracies = [r["accuracy"] for r in all_results]
latencies = [r["avg_latency"] for r in all_results]
costs = [r["avg_cost"] for r in all_results]
summary = {
"iterations": len(all_results),
"accuracy": {
"mean": np.mean(accuracies),
"std": np.std(accuracies),
"min": np.min(accuracies),
"max": np.max(accuracies)
},
"latency": {
"mean": np.mean(latencies),
"std": np.std(latencies),
"min": np.min(latencies),
"max": np.max(latencies)
},
"cost": {
"mean": np.mean(costs),
"std": np.std(costs),
"min": np.min(costs),
"max": np.max(costs)
}
}
return summary
```
**Results Report Example**:
```markdown
# Baseline Evaluation Results
Execution Date: 2024-11-24 10:00:00
Run Count: 5
Test Case Count: 20
## Evaluation Metrics Summary
| Metric | Mean | Std Dev | Min | Max | Target | Gap |
|--------|------|---------|-----|-----|--------|-----|
| Accuracy | 75.0% | 3.2% | 70.0% | 80.0% | 90.0% | **-15.0%** |
| Latency | 2.5s | 0.4s | 2.1s | 3.2s | 2.0s | **+0.5s** |
| Cost/req | $0.015 | $0.002 | $0.013 | $0.018 | $0.010 | **+$0.005** |
## Detailed Analysis
### Accuracy Issues
- **Current**: 75.0% (Target: 90.0%)
- **Main error patterns**:
1. Intent classification errors: 12 cases (60% of errors)
2. Context understanding deficiency: 5 cases (25% of errors)
3. Handling ambiguous questions: 3 cases (15% of errors)
### Latency Issues
- **Current**: 2.5s (Target: 2.0s)
- **Bottlenecks**:
1. generate_response node: avg 1.8s (72% of total)
2. analyze_intent node: avg 0.5s (20% of total)
3. Other: avg 0.2s (8% of total)
### Cost Issues
- **Current**: $0.015/req (Target: $0.010/req)
- **Cost breakdown**:
1. generate_response: $0.011 (73%)
2. analyze_intent: $0.003 (20%)
3. Other: $0.001 (7%)
- **Main factor**: High output token count (avg 800 tokens)
## Improvement Directions
### Priority 1: Improve analyze_intent accuracy
- **Impact**: Direct impact on accuracy (accounts for 60% of -15% gap)
- **Improvements**: Few-shot examples, clear classification criteria, JSON output format
- **Estimated effect**: +10-12% accuracy
### Priority 2: Optimize generate_response efficiency
- **Impact**: Affects both latency and cost
- **Improvements**: Conciseness instructions, max_tokens limit, temperature adjustment
- **Estimated effect**: -0.4s latency, -$0.004 cost
```

View File

@@ -0,0 +1,225 @@
# Phase 3: Iterative Improvement
Phase for data-driven, incremental prompt optimization.
**Time Required**: 1-2 hours per iteration × number of iterations (typically 3-5)
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Prompt Optimization](./prompt_optimization.md)
---
## Phase 3: Iterative Improvement
### Iteration Cycle
Execute the following in each iteration:
1. **Prioritization** (Step 7)
2. **Implement Improvements** (Step 8)
3. **Post-Improvement Evaluation** (Step 9)
4. **Compare Results** (Step 10)
5. **Continue Decision** (Step 11)
### Step 7: Prioritization
**Decision Criteria**:
1. **Impact on goal achievement**
2. **Feasibility of improvement**
3. **Implementation cost**
**Priority Matrix**:
```markdown
## Improvement Priority Matrix
| Node | Impact | Feasibility | Impl Cost | Total Score | Priority |
|------|--------|-------------|-----------|-------------|----------|
| analyze_intent | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 14/15 | 1st |
| generate_response | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 12/15 | 2nd |
| retrieve_context | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 8/15 | 3rd |
**Iteration 1 Target**: analyze_intent node
```
### Step 8: Implement Improvements
**Pre-Improvement Prompt** (`src/nodes/analyzer.py`):
```python
# Before
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=1.0
)
messages = [
SystemMessage(content="You are an intent analyzer. Analyze user input."),
HumanMessage(content=f"Analyze: {state['user_input']}")
]
response = llm.invoke(messages)
state["intent"] = response.content
return state
```
**Post-Improvement Prompt**:
```python
# After - Iteration 1
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.3 # Lower temperature for classification tasks
)
# Clear classification categories and few-shot examples
system_prompt = """You are an intent classifier for a customer support chatbot.
Classify user input into one of these categories:
- "product_inquiry": Questions about products or services
- "technical_support": Technical issues or troubleshooting
- "billing": Payment, invoicing, or billing questions
- "general": General questions or chitchat
Output ONLY a valid JSON object with this structure:
{
"intent": "<category>",
"confidence": <0.0-1.0>,
"reasoning": "<brief explanation>"
}
Examples:
Input: "How much does the premium plan cost?"
Output: {"intent": "product_inquiry", "confidence": 0.95, "reasoning": "Question about product pricing"}
Input: "I can't log into my account"
Output: {"intent": "technical_support", "confidence": 0.9, "reasoning": "Authentication issue"}
Input: "Why was I charged twice?"
Output: {"intent": "billing", "confidence": 0.95, "reasoning": "Question about billing charges"}
Input: "Hello, how are you?"
Output: {"intent": "general", "confidence": 0.85, "reasoning": "General greeting"}
Input: "What's the return policy?"
Output: {"intent": "product_inquiry", "confidence": 0.9, "reasoning": "Question about product policy"}
"""
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=f"Input: {state['user_input']}\nOutput:")
]
response = llm.invoke(messages)
# JSON parsing (with error handling)
try:
intent_data = json.loads(response.content)
state["intent"] = intent_data["intent"]
state["confidence"] = intent_data["confidence"]
except json.JSONDecodeError:
# Fallback
state["intent"] = "general"
state["confidence"] = 0.5
return state
```
**Summary of Changes**:
1. ✅ temperature: 1.0 → 0.3 (appropriate for classification tasks)
2. ✅ Clear classification categories (4 intents)
3. ✅ Few-shot examples (added 5)
4. ✅ JSON output format (structured output)
5. ✅ Error handling (fallback for JSON parse failures)
### Step 9: Post-Improvement Evaluation
**Execution**:
```bash
# Execute post-improvement evaluation under same conditions
./evaluation_after_iteration1.sh
```
### Step 10: Compare Results
**Comparison Report Example**:
```markdown
# Iteration 1 Evaluation Results
Execution Date: 2024-11-24 12:00:00
Changes: Optimization of analyze_intent node
## Results Comparison
| Metric | Baseline | Iteration 1 | Change | % Change | Target | Achievement |
|--------|----------|-------------|--------|----------|--------|-------------|
| **Accuracy** | 75.0% | **86.0%** | **+11.0%** | +14.7% | 90.0% | 95.6% |
| **Latency** | 2.5s | 2.4s | -0.1s | -4.0% | 2.0s | 80.0% |
| **Cost/req** | $0.015 | $0.014 | -$0.001 | -6.7% | $0.010 | 71.4% |
## Detailed Analysis
### Accuracy Improvement
- **Improvement**: +11.0% (75.0% → 86.0%)
- **Remaining gap**: 4.0% (target 90.0%)
- **Improved cases**: Intent classification errors reduced from 12 → 3 cases
- **Still needs improvement**: Context understanding deficiency cases (5 cases)
### Slight Latency Improvement
- **Improvement**: -0.1s (2.5s → 2.4s)
- **Main factor**: Lower temperature in analyze_intent made output more concise
- **Remaining bottleneck**: generate_response (avg 1.8s)
### Slight Cost Reduction
- **Reduction**: -$0.001 (6.7% reduction)
- **Factor**: Reduced output tokens in analyze_intent
- **Main cost**: generate_response still accounts for 73%
## Next Iteration Strategy
### Priority 1: Optimize generate_response
- **Goal**: Latency 1.8s → 1.4s, Cost $0.011 → $0.007
- **Approach**:
1. Add conciseness instructions
2. Limit max_tokens to 500
3. Adjust temperature from 0.7 → 0.5
### Priority 2: Final 4% accuracy improvement
- **Goal**: 86.0% → 90.0% or higher
- **Approach**: Improve context understanding (retrieve_context node)
## Decision
✅ Continue → Proceed to Iteration 2
```
### Step 11: Continue Decision
**Decision Criteria**:
```python
def should_continue_iteration(results: Dict, goals: Dict) -> bool:
"""Determine if iteration should continue"""
all_goals_met = True
for metric, goal in goals.items():
if metric == "accuracy":
if results[metric] < goal:
all_goals_met = False
elif metric in ["latency", "cost"]:
if results[metric] > goal:
all_goals_met = False
return not all_goals_met
# Example
goals = {"accuracy": 90.0, "latency": 2.0, "cost": 0.010}
results = {"accuracy": 86.0, "latency": 2.4, "cost": 0.014}
if should_continue_iteration(results, goals):
print("Proceed to next iteration")
else:
print("Goals achieved - Move to Phase 4")
```
**Iteration Limit**:
- **Recommended**: 3-5 iterations
- **Reason**: Beyond this, law of diminishing returns likely applies
- **Exception**: Critical applications may require 10+ iterations

View File

@@ -0,0 +1,339 @@
# Phase 4: Completion and Documentation
Phase to record final results and commit code.
**Time Required**: 30 minutes - 1 hour
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)
---
## Phase 4: Completion and Documentation
### Step 12: Create Final Evaluation Report
**Report Template**:
```markdown
# LangGraph Application Fine-Tuning Completion Report
Project: [Project Name]
Implementation Period: 2024-11-24 10:00 - 2024-11-24 15:00 (5 hours)
Implementer: Claude Code with fine-tune skill
## Executive Summary
This fine-tuning project executed prompt optimization for a LangGraph chatbot application and achieved the following results:
-**Accuracy**: 75.0% → 92.0% (+17.0%, achieved 90% target)
-**Latency**: 2.5s → 1.9s (-24.0%, achieved 2.0s target)
- ⚠️ **Cost**: $0.015 → $0.011 (-26.7%, target $0.010 not met)
A total of 3 iterations were executed, achieving 2 out of 3 metric targets.
## Implementation Summary
### Iteration Count and Execution Time
- **Total Iterations**: 3
- **Optimized Nodes**: 2 (analyze_intent, generate_response)
- **Evaluation Run Count**: 20 times (baseline 5 times + 5 times × 3 post-iteration)
- **Total Execution Time**: Approximately 5 hours
### Final Results
| Metric | Initial | Final | Improvement | % Change | Target | Achievement |
|--------|---------|-------|-------------|----------|--------|-------------|
| Accuracy | 75.0% | 92.0% | +17.0% | +22.7% | 90.0% | ✅ 102.2% achieved |
| Latency | 2.5s | 1.9s | -0.6s | -24.0% | 2.0s | ✅ 95.0% achieved |
| Cost/req | $0.015 | $0.011 | -$0.004 | -26.7% | $0.010 | ⚠️ 90.9% achieved |
## Iteration Details
### Iteration 1: Optimization of analyze_intent node
**Date/Time**: 2024-11-24 11:00
**Target Node**: src/nodes/analyzer.py:25-45
**Changes**:
1. temperature: 1.0 → 0.3
2. Added 5 few-shot examples
3. Structured JSON output format
4. Defined clear classification categories (4)
**Results**:
- Accuracy: 75.0% → 86.0% (+11.0%)
- Latency: 2.5s → 2.4s (-0.1s)
- Cost: $0.015 → $0.014 (-$0.001)
**Learning**: Few-shot examples and clear output format most effective for accuracy improvement
---
### Iteration 2: Optimization of generate_response node
**Date/Time**: 2024-11-24 13:00
**Target Node**: src/nodes/generator.py:45-68
**Changes**:
1. Added conciseness instructions ("answer in 2-3 sentences")
2. max_tokens: unlimited → 500
3. temperature: 0.7 → 0.5
4. Clarified response style
**Results**:
- Accuracy: 86.0% → 88.0% (+2.0%)
- Latency: 2.4s → 2.0s (-0.4s)
- Cost: $0.014 → $0.011 (-$0.003)
**Learning**: max_tokens limit contributed significantly to latency and cost reduction
---
### Iteration 3: Additional improvement of analyze_intent
**Date/Time**: 2024-11-24 14:30
**Target Node**: src/nodes/analyzer.py:25-45
**Changes**:
1. Increased few-shot examples from 5 → 10
2. Added edge case handling
3. Re-classification logic with confidence threshold
**Results**:
- Accuracy: 88.0% → 92.0% (+4.0%)
- Latency: 2.0s → 1.9s (-0.1s)
- Cost: $0.011 → $0.011 (±0)
**Learning**: Additional few-shot examples broke through final accuracy barrier
## Final Changes
### src/nodes/analyzer.py (analyze_intent node)
#### Before
```python
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=1.0)
messages = [
SystemMessage(content="You are an intent analyzer. Analyze user input."),
HumanMessage(content=f"Analyze: {state['user_input']}")
]
response = llm.invoke(messages)
state["intent"] = response.content
return state
```
#### After
```python
def analyze_intent(state: GraphState) -> GraphState:
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0.3)
system_prompt = """You are an intent classifier for a customer support chatbot.
Classify user input into: product_inquiry, technical_support, billing, or general.
Output JSON: {"intent": "<category>", "confidence": <0.0-1.0>, "reasoning": "<explanation>"}
[10 few-shot examples...]
"""
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=f"Input: {state['user_input']}\nOutput:")
]
response = llm.invoke(messages)
intent_data = json.loads(response.content)
# Low confidence → re-classify as general
if intent_data["confidence"] < 0.7:
intent_data["intent"] = "general"
state["intent"] = intent_data["intent"]
state["confidence"] = intent_data["confidence"]
return state
```
**Key Changes**:
- temperature: 1.0 → 0.3
- Few-shot examples: 0 → 10
- Output: free text → JSON
- Added confidence threshold fallback
---
### src/nodes/generator.py (generate_response node)
#### Before
```python
def generate_response(state: GraphState) -> GraphState:
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "Generate helpful response based on context."),
("human", "{context}\n\nQuestion: {question}")
])
chain = prompt | llm
response = chain.invoke({"context": state["context"], "question": state["user_input"]})
state["response"] = response.content
return state
```
#### After
```python
def generate_response(state: GraphState) -> GraphState:
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0.5,
max_tokens=500 # Output length limit
)
system_prompt = """You are a helpful customer support assistant.
Guidelines:
- Be concise: Answer in 2-3 sentences
- Be friendly: Use a warm, professional tone
- Be accurate: Base your answer on the provided context
- If uncertain: Acknowledge and offer to escalate
Format: Direct answer followed by one optional clarifying sentence.
"""
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "Context: {context}\n\nQuestion: {question}\n\nAnswer:")
])
chain = prompt | llm
response = chain.invoke({"context": state["context"], "question": state["user_input"]})
state["response"] = response.content
return state
```
**Key Changes**:
- temperature: 0.7 → 0.5
- max_tokens: unlimited → 500
- Clear conciseness instruction ("2-3 sentences")
- Added response style guidelines
## Detailed Evaluation Results
### Improvement Status by Test Case
| Case ID | Category | Before | After | Improved |
|---------|----------|--------|-------|----------|
| TC001 | Product | ❌ Wrong | ✅ Correct | ✅ |
| TC002 | Technical | ❌ Wrong | ✅ Correct | ✅ |
| TC003 | Billing | ✅ Correct | ✅ Correct | - |
| TC004 | General | ✅ Correct | ✅ Correct | - |
| TC005 | Product | ❌ Wrong | ✅ Correct | ✅ |
| ... | ... | ... | ... | ... |
| TC020 | Technical | ✅ Correct | ✅ Correct | - |
**Improved Cases**: 15/20 (75%)
**Maintained Cases**: 5/20 (25%)
**Degraded Cases**: 0/20 (0%)
### Latency Breakdown
| Node | Before | After | Change | % Change |
|------|--------|-------|--------|----------|
| analyze_intent | 0.5s | 0.4s | -0.1s | -20% |
| retrieve_context | 0.2s | 0.2s | ±0s | 0% |
| generate_response | 1.8s | 1.3s | -0.5s | -28% |
| **Total** | **2.5s** | **1.9s** | **-0.6s** | **-24%** |
### Cost Breakdown
| Node | Before | After | Change | % Change |
|------|--------|-------|--------|----------|
| analyze_intent | $0.003 | $0.003 | ±$0 | 0% |
| retrieve_context | $0.001 | $0.001 | ±$0 | 0% |
| generate_response | $0.011 | $0.007 | -$0.004 | -36% |
| **Total** | **$0.015** | **$0.011** | **-$0.004** | **-27%** |
## Future Recommendations
### Short-term (1-2 weeks)
1. **Achieve cost target**: $0.011 → $0.010
- Approach: Consider partial migration to Claude 3.5 Haiku
- Estimated effect: -$0.002-0.003/req
2. **Further accuracy improvement**: 92.0% → 95.0%
- Approach: Analyze error cases and add few-shot examples
- Estimated effect: +3.0%
### Mid-term (1-2 months)
1. **Model optimization**
- Use Haiku for simple intent classification
- Use Sonnet only for complex response generation
- Estimated effect: -30-40% cost, minimal latency impact
2. **Leverage prompt caching**
- Cache system prompts and few-shot examples
- Estimated effect: -50% cost (when cache hits)
### Long-term (3-6 months)
1. **Consider fine-tuned models**
- Model fine-tuning with proprietary data
- No need for few-shot examples, more concise prompts
- Estimated effect: -60% cost, +5% accuracy
## Conclusion
This project achieved the following through fine-tuning of the LangGraph application:
**Successes**:
1. Significant accuracy improvement (+22.7%) - exceeded target by 2.2%
2. Notable latency improvement (-24.0%) - exceeded target by 5%
3. Cost reduction (-26.7%) - 9.1% away from target
⚠️ **Challenges**:
1. Cost target not met ($0.011 vs $0.010 target) - addressable through migration to lighter models
📈 **Business Impact**:
- Improved user satisfaction (through accuracy improvement)
- Reduced operational costs (through latency and cost reduction)
- Improved scalability (through efficient resource usage)
🎯 **Next Steps**:
1. Validate migration to lighter models for cost reduction
2. Continuous monitoring and evaluation
3. Expansion to new use cases
---
Created: 2024-11-24 15:00:00
Creator: Claude Code (fine-tune skill)
```
### Step 13: Commit Code and Update Documentation
**Git Commit Example**:
```bash
# Commit changes
git add src/nodes/analyzer.py src/nodes/generator.py
git commit -m "feat: optimize LangGraph prompts for accuracy and latency
Iteration 1-3 of fine-tuning process:
- analyze_intent: added few-shot examples, JSON output, lower temperature
- generate_response: added conciseness guidelines, max_tokens limit
Results:
- Accuracy: 75.0% → 92.0% (+17.0%, goal 90% ✅)
- Latency: 2.5s → 1.9s (-0.6s, goal 2.0s ✅)
- Cost: $0.015 → $0.011 (-$0.004, goal $0.010 ⚠️)
Full report: evaluation_results/final_report.md"
# Commit evaluation results
git add evaluation_results/
git commit -m "docs: add fine-tuning evaluation results and final report"
# Add tag
git tag -a fine-tune-v1.0 -m "Fine-tuning completed: 92% accuracy achieved"
```
## Summary
Following this workflow enables:
- ✅ Systematic fine-tuning process execution
- ✅ Data-driven decision making
- ✅ Continuous improvement and verification
- ✅ Complete documentation and traceability