Initial commit
This commit is contained in:
229
skills/fine-tune/workflow_phase1.md
Normal file
229
skills/fine-tune/workflow_phase1.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Phase 1: Preparation and Analysis
|
||||
|
||||
Preparation phase to clarify optimization direction and identify targets for improvement.
|
||||
|
||||
**Time Required**: 30 minutes - 1 hour
|
||||
|
||||
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Preparation and Analysis
|
||||
|
||||
### Step 1: Read and Understand fine-tune.md
|
||||
|
||||
**Purpose**: Clarify optimization direction
|
||||
|
||||
**Execution**:
|
||||
```python
|
||||
# Read .langgraph-master/fine-tune.md
|
||||
file_path = ".langgraph-master/fine-tune.md"
|
||||
with open(file_path, "r") as f:
|
||||
fine_tune_spec = f.read()
|
||||
|
||||
# Extract the following information:
|
||||
# - Optimization goals (accuracy, latency, cost, etc.)
|
||||
# - Evaluation methods (test cases, metrics, calculation methods)
|
||||
# - Passing criteria (target values for each metric)
|
||||
# - Test data location
|
||||
```
|
||||
|
||||
**Typical fine-tune.md structure**:
|
||||
```markdown
|
||||
# Fine-Tuning Goals
|
||||
|
||||
## Optimization Objectives
|
||||
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
|
||||
- **Latency**: Reduce response time to 2.0 seconds or less
|
||||
- **Cost**: Reduce cost per request to $0.010 or less
|
||||
|
||||
## Evaluation Methods
|
||||
- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
|
||||
- **Execution Command**: uv run python -m src.evaluate
|
||||
- **Evaluation Script**: tests/evaluation/evaluator.py
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
### Accuracy
|
||||
- Calculation method: (Correct count / Total cases) × 100
|
||||
- Target value: 90% or higher
|
||||
|
||||
### Latency
|
||||
- Calculation method: Average time per execution
|
||||
- Target value: 2.0 seconds or less
|
||||
|
||||
### Cost
|
||||
- Calculation method: Total API cost / Total requests
|
||||
- Target value: $0.010 or less
|
||||
|
||||
## Passing Criteria
|
||||
All evaluation metrics must achieve their target values
|
||||
```
|
||||
|
||||
### Step 2: Identify Optimization Targets with Serena MCP
|
||||
|
||||
**Purpose**: Comprehensively identify nodes calling LLMs
|
||||
|
||||
**Execution Steps**:
|
||||
|
||||
1. **Search for LLM clients**
|
||||
```python
|
||||
# Use Serena MCP: find_symbol
|
||||
# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.
|
||||
|
||||
patterns = [
|
||||
"ChatAnthropic",
|
||||
"ChatOpenAI",
|
||||
"ChatGoogleGenerativeAI",
|
||||
"ChatVertexAI"
|
||||
]
|
||||
|
||||
llm_usages = []
|
||||
for pattern in patterns:
|
||||
results = serena.find_symbol(
|
||||
name_path=pattern,
|
||||
substring_matching=True,
|
||||
include_body=False
|
||||
)
|
||||
llm_usages.extend(results)
|
||||
```
|
||||
|
||||
2. **Identify prompt construction locations**
|
||||
```python
|
||||
# For each LLM call, investigate how prompts are constructed
|
||||
for usage in llm_usages:
|
||||
# Get surrounding context with find_referencing_symbols
|
||||
context = serena.find_referencing_symbols(
|
||||
name_path=usage.name,
|
||||
relative_path=usage.file_path
|
||||
)
|
||||
|
||||
# Identify prompt templates and message construction logic
|
||||
# - Use of ChatPromptTemplate
|
||||
# - SystemMessage, HumanMessage definitions
|
||||
# - Prompt construction with f-strings or format()
|
||||
```
|
||||
|
||||
3. **Per-node analysis**
|
||||
```python
|
||||
# Analyze LLM usage patterns within each node function
|
||||
# - Prompt clarity
|
||||
# - Presence of few-shot examples
|
||||
# - Structured output format
|
||||
# - Parameter settings (temperature, max_tokens, etc.)
|
||||
```
|
||||
|
||||
**Example Output**:
|
||||
```markdown
|
||||
## LLM Call Location Analysis
|
||||
|
||||
### 1. analyze_intent node
|
||||
- **File**: src/nodes/analyzer.py
|
||||
- **Line numbers**: 25-45
|
||||
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
|
||||
- **Prompt structure**:
|
||||
```python
|
||||
SystemMessage: "You are an intent analyzer..."
|
||||
HumanMessage: f"Analyze: {user_input}"
|
||||
```
|
||||
- **Improvement potential**: ⭐⭐⭐⭐⭐ (High)
|
||||
- Prompt is vague ("Analyze" criteria unclear)
|
||||
- No few-shot examples
|
||||
- Output format is free text
|
||||
- **Estimated improvement effect**: Accuracy +10-15%
|
||||
|
||||
### 2. generate_response node
|
||||
- **File**: src/nodes/generator.py
|
||||
- **Line numbers**: 45-68
|
||||
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
|
||||
- **Prompt structure**:
|
||||
```python
|
||||
ChatPromptTemplate.from_messages([
|
||||
("system", "Generate helpful response..."),
|
||||
("human", "{context}\n\nQuestion: {question}")
|
||||
])
|
||||
```
|
||||
- **Improvement potential**: ⭐⭐⭐ (Medium)
|
||||
- Prompt is structured but lacks conciseness instructions
|
||||
- No max_tokens limit → possibility of verbose output
|
||||
- **Estimated improvement effect**: Latency -0.3-0.5s, Cost -20-30%
|
||||
```
|
||||
|
||||
### Step 3: Create Optimization Target List
|
||||
|
||||
**Purpose**: Organize information to determine improvement priorities
|
||||
|
||||
**List Creation Template**:
|
||||
```markdown
|
||||
# Optimization Target List
|
||||
|
||||
## Node: analyze_intent
|
||||
|
||||
### Basic Information
|
||||
- **File**: src/nodes/analyzer.py:25-45
|
||||
- **Role**: Classify user input intent
|
||||
- **LLM Model**: claude-3-5-sonnet-20241022
|
||||
- **Current Parameters**: temperature=1.0, max_tokens=default
|
||||
|
||||
### Current Prompt
|
||||
```python
|
||||
SystemMessage(content="You are an intent analyzer. Analyze user input.")
|
||||
HumanMessage(content=f"Analyze: {user_input}")
|
||||
```
|
||||
|
||||
### Issues
|
||||
1. **Vague instructions**: Specific criteria for "Analyze" unclear
|
||||
2. **No few-shot**: No expected output examples
|
||||
3. **Undefined output format**: Unstructured free text
|
||||
4. **High temperature**: 1.0 is too high for classification tasks
|
||||
|
||||
### Improvement Ideas
|
||||
1. Specify concrete classification categories
|
||||
2. Add 3-5 few-shot examples
|
||||
3. Specify JSON output format
|
||||
4. Lower temperature to 0.3-0.5
|
||||
|
||||
### Estimated Improvement Effect
|
||||
- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
|
||||
- **Latency**: ±0 (No change)
|
||||
- **Cost**: ±0 (No change)
|
||||
|
||||
### Priority
|
||||
⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement
|
||||
|
||||
---
|
||||
|
||||
## Node: generate_response
|
||||
|
||||
### Basic Information
|
||||
- **File**: src/nodes/generator.py:45-68
|
||||
- **Role**: Generate final user-facing response
|
||||
- **LLM Model**: claude-3-5-sonnet-20241022
|
||||
- **Current Parameters**: temperature=0.7, max_tokens=default
|
||||
|
||||
### Current Prompt
|
||||
```python
|
||||
ChatPromptTemplate.from_messages([
|
||||
("system", "Generate helpful response based on context."),
|
||||
("human", "{context}\n\nQuestion: {question}")
|
||||
])
|
||||
```
|
||||
|
||||
### Issues
|
||||
1. **No verbosity control**: No conciseness instructions
|
||||
2. **max_tokens not set**: Possibility of unnecessarily long output
|
||||
3. **Undefined response style**: No tone or style specifications
|
||||
|
||||
### Improvement Ideas
|
||||
1. Add length instructions like "be concise" "in 2-3 sentences"
|
||||
2. Limit max_tokens to 500
|
||||
3. Clarify response style ("friendly" "professional" etc.)
|
||||
|
||||
### Estimated Improvement Effect
|
||||
- **Accuracy**: ±0 (No change)
|
||||
- **Latency**: -0.3-0.5s (Due to reduced output tokens)
|
||||
- **Cost**: -20-30% (Due to reduced token count)
|
||||
|
||||
### Priority
|
||||
⭐⭐⭐ (Medium) - Improvement in latency and cost
|
||||
```
|
||||
Reference in New Issue
Block a user