Files
gh-hiroshi75-protografico-p…/skills/fine-tune/workflow_phase1.md
2025-11-29 18:45:58 +08:00

230 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 1: Preparation and Analysis
Preparation phase to clarify optimization direction and identify targets for improvement.
**Time Required**: 30 minutes - 1 hour
**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)
---
## Phase 1: Preparation and Analysis
### Step 1: Read and Understand fine-tune.md
**Purpose**: Clarify optimization direction
**Execution**:
```python
# Read .langgraph-master/fine-tune.md
file_path = ".langgraph-master/fine-tune.md"
with open(file_path, "r") as f:
fine_tune_spec = f.read()
# Extract the following information:
# - Optimization goals (accuracy, latency, cost, etc.)
# - Evaluation methods (test cases, metrics, calculation methods)
# - Passing criteria (target values for each metric)
# - Test data location
```
**Typical fine-tune.md structure**:
```markdown
# Fine-Tuning Goals
## Optimization Objectives
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
- **Latency**: Reduce response time to 2.0 seconds or less
- **Cost**: Reduce cost per request to $0.010 or less
## Evaluation Methods
- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
- **Execution Command**: uv run python -m src.evaluate
- **Evaluation Script**: tests/evaluation/evaluator.py
## Evaluation Metrics
### Accuracy
- Calculation method: (Correct count / Total cases) × 100
- Target value: 90% or higher
### Latency
- Calculation method: Average time per execution
- Target value: 2.0 seconds or less
### Cost
- Calculation method: Total API cost / Total requests
- Target value: $0.010 or less
## Passing Criteria
All evaluation metrics must achieve their target values
```
### Step 2: Identify Optimization Targets with Serena MCP
**Purpose**: Comprehensively identify nodes calling LLMs
**Execution Steps**:
1. **Search for LLM clients**
```python
# Use Serena MCP: find_symbol
# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.
patterns = [
"ChatAnthropic",
"ChatOpenAI",
"ChatGoogleGenerativeAI",
"ChatVertexAI"
]
llm_usages = []
for pattern in patterns:
results = serena.find_symbol(
name_path=pattern,
substring_matching=True,
include_body=False
)
llm_usages.extend(results)
```
2. **Identify prompt construction locations**
```python
# For each LLM call, investigate how prompts are constructed
for usage in llm_usages:
# Get surrounding context with find_referencing_symbols
context = serena.find_referencing_symbols(
name_path=usage.name,
relative_path=usage.file_path
)
# Identify prompt templates and message construction logic
# - Use of ChatPromptTemplate
# - SystemMessage, HumanMessage definitions
# - Prompt construction with f-strings or format()
```
3. **Per-node analysis**
```python
# Analyze LLM usage patterns within each node function
# - Prompt clarity
# - Presence of few-shot examples
# - Structured output format
# - Parameter settings (temperature, max_tokens, etc.)
```
**Example Output**:
```markdown
## LLM Call Location Analysis
### 1. analyze_intent node
- **File**: src/nodes/analyzer.py
- **Line numbers**: 25-45
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
```python
SystemMessage: "You are an intent analyzer..."
HumanMessage: f"Analyze: {user_input}"
```
- **Improvement potential**: ⭐⭐⭐⭐⭐ (High)
- Prompt is vague ("Analyze" criteria unclear)
- No few-shot examples
- Output format is free text
- **Estimated improvement effect**: Accuracy +10-15%
### 2. generate_response node
- **File**: src/nodes/generator.py
- **Line numbers**: 45-68
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
```python
ChatPromptTemplate.from_messages([
("system", "Generate helpful response..."),
("human", "{context}\n\nQuestion: {question}")
])
```
- **Improvement potential**: ⭐⭐⭐ (Medium)
- Prompt is structured but lacks conciseness instructions
- No max_tokens limit → possibility of verbose output
- **Estimated improvement effect**: Latency -0.3-0.5s, Cost -20-30%
```
### Step 3: Create Optimization Target List
**Purpose**: Organize information to determine improvement priorities
**List Creation Template**:
```markdown
# Optimization Target List
## Node: analyze_intent
### Basic Information
- **File**: src/nodes/analyzer.py:25-45
- **Role**: Classify user input intent
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=1.0, max_tokens=default
### Current Prompt
```python
SystemMessage(content="You are an intent analyzer. Analyze user input.")
HumanMessage(content=f"Analyze: {user_input}")
```
### Issues
1. **Vague instructions**: Specific criteria for "Analyze" unclear
2. **No few-shot**: No expected output examples
3. **Undefined output format**: Unstructured free text
4. **High temperature**: 1.0 is too high for classification tasks
### Improvement Ideas
1. Specify concrete classification categories
2. Add 3-5 few-shot examples
3. Specify JSON output format
4. Lower temperature to 0.3-0.5
### Estimated Improvement Effect
- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
- **Latency**: ±0 (No change)
- **Cost**: ±0 (No change)
### Priority
⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement
---
## Node: generate_response
### Basic Information
- **File**: src/nodes/generator.py:45-68
- **Role**: Generate final user-facing response
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=0.7, max_tokens=default
### Current Prompt
```python
ChatPromptTemplate.from_messages([
("system", "Generate helpful response based on context."),
("human", "{context}\n\nQuestion: {question}")
])
```
### Issues
1. **No verbosity control**: No conciseness instructions
2. **max_tokens not set**: Possibility of unnecessarily long output
3. **Undefined response style**: No tone or style specifications
### Improvement Ideas
1. Add length instructions like "be concise" "in 2-3 sentences"
2. Limit max_tokens to 500
3. Clarify response style ("friendly" "professional" etc.)
### Estimated Improvement Effect
- **Accuracy**: ±0 (No change)
- **Latency**: -0.3-0.5s (Due to reduced output tokens)
- **Cost**: -20-30% (Due to reduced token count)
### Priority
⭐⭐⭐ (Medium) - Improvement in latency and cost
```