6.3 KiB
6.3 KiB
Phase 1: Preparation and Analysis
Preparation phase to clarify optimization direction and identify targets for improvement.
Time Required: 30 minutes - 1 hour
📋 Related Documents: Overall Workflow | Practical Examples
Phase 1: Preparation and Analysis
Step 1: Read and Understand fine-tune.md
Purpose: Clarify optimization direction
Execution:
# Read .langgraph-master/fine-tune.md
file_path = ".langgraph-master/fine-tune.md"
with open(file_path, "r") as f:
fine_tune_spec = f.read()
# Extract the following information:
# - Optimization goals (accuracy, latency, cost, etc.)
# - Evaluation methods (test cases, metrics, calculation methods)
# - Passing criteria (target values for each metric)
# - Test data location
Typical fine-tune.md structure:
# Fine-Tuning Goals
## Optimization Objectives
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
- **Latency**: Reduce response time to 2.0 seconds or less
- **Cost**: Reduce cost per request to $0.010 or less
## Evaluation Methods
- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
- **Execution Command**: uv run python -m src.evaluate
- **Evaluation Script**: tests/evaluation/evaluator.py
## Evaluation Metrics
### Accuracy
- Calculation method: (Correct count / Total cases) × 100
- Target value: 90% or higher
### Latency
- Calculation method: Average time per execution
- Target value: 2.0 seconds or less
### Cost
- Calculation method: Total API cost / Total requests
- Target value: $0.010 or less
## Passing Criteria
All evaluation metrics must achieve their target values
Step 2: Identify Optimization Targets with Serena MCP
Purpose: Comprehensively identify nodes calling LLMs
Execution Steps:
- Search for LLM clients
# Use Serena MCP: find_symbol
# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.
patterns = [
"ChatAnthropic",
"ChatOpenAI",
"ChatGoogleGenerativeAI",
"ChatVertexAI"
]
llm_usages = []
for pattern in patterns:
results = serena.find_symbol(
name_path=pattern,
substring_matching=True,
include_body=False
)
llm_usages.extend(results)
- Identify prompt construction locations
# For each LLM call, investigate how prompts are constructed
for usage in llm_usages:
# Get surrounding context with find_referencing_symbols
context = serena.find_referencing_symbols(
name_path=usage.name,
relative_path=usage.file_path
)
# Identify prompt templates and message construction logic
# - Use of ChatPromptTemplate
# - SystemMessage, HumanMessage definitions
# - Prompt construction with f-strings or format()
- Per-node analysis
# Analyze LLM usage patterns within each node function
# - Prompt clarity
# - Presence of few-shot examples
# - Structured output format
# - Parameter settings (temperature, max_tokens, etc.)
Example Output:
## LLM Call Location Analysis
### 1. analyze_intent node
- **File**: src/nodes/analyzer.py
- **Line numbers**: 25-45
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
```python
SystemMessage: "You are an intent analyzer..."
HumanMessage: f"Analyze: {user_input}"
- Improvement potential: ⭐⭐⭐⭐⭐ (High)
- Prompt is vague ("Analyze" criteria unclear)
- No few-shot examples
- Output format is free text
- Estimated improvement effect: Accuracy +10-15%
2. generate_response node
- File: src/nodes/generator.py
- Line numbers: 45-68
- LLM: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- Prompt structure:
ChatPromptTemplate.from_messages([ ("system", "Generate helpful response..."), ("human", "{context}\n\nQuestion: {question}") ]) - Improvement potential: ⭐⭐⭐ (Medium)
- Prompt is structured but lacks conciseness instructions
- No max_tokens limit → possibility of verbose output
- Estimated improvement effect: Latency -0.3-0.5s, Cost -20-30%
### Step 3: Create Optimization Target List
**Purpose**: Organize information to determine improvement priorities
**List Creation Template**:
```markdown
# Optimization Target List
## Node: analyze_intent
### Basic Information
- **File**: src/nodes/analyzer.py:25-45
- **Role**: Classify user input intent
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=1.0, max_tokens=default
### Current Prompt
```python
SystemMessage(content="You are an intent analyzer. Analyze user input.")
HumanMessage(content=f"Analyze: {user_input}")
Issues
- Vague instructions: Specific criteria for "Analyze" unclear
- No few-shot: No expected output examples
- Undefined output format: Unstructured free text
- High temperature: 1.0 is too high for classification tasks
Improvement Ideas
- Specify concrete classification categories
- Add 3-5 few-shot examples
- Specify JSON output format
- Lower temperature to 0.3-0.5
Estimated Improvement Effect
- Accuracy: +10-15% (Current misclassification 20% → 5-10%)
- Latency: ±0 (No change)
- Cost: ±0 (No change)
Priority
⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement
Node: generate_response
Basic Information
- File: src/nodes/generator.py:45-68
- Role: Generate final user-facing response
- LLM Model: claude-3-5-sonnet-20241022
- Current Parameters: temperature=0.7, max_tokens=default
Current Prompt
ChatPromptTemplate.from_messages([
("system", "Generate helpful response based on context."),
("human", "{context}\n\nQuestion: {question}")
])
Issues
- No verbosity control: No conciseness instructions
- max_tokens not set: Possibility of unnecessarily long output
- Undefined response style: No tone or style specifications
Improvement Ideas
- Add length instructions like "be concise" "in 2-3 sentences"
- Limit max_tokens to 500
- Clarify response style ("friendly" "professional" etc.)
Estimated Improvement Effect
- Accuracy: ±0 (No change)
- Latency: -0.3-0.5s (Due to reduced output tokens)
- Cost: -20-30% (Due to reduced token count)
Priority
⭐⭐⭐ (Medium) - Improvement in latency and cost