gh-hiroshi75-protografico-p…/skills/fine-tune/workflow_phase1.md

# Phase 1: Preparation and Analysis

Preparation phase to clarify optimization direction and identify targets for improvement.

**Time Required**: 30 minutes - 1 hour

**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)

---

## Phase 1: Preparation and Analysis

### Step 1: Read and Understand fine-tune.md

**Purpose**: Clarify optimization direction

**Execution**:
```python
# Read .langgraph-master/fine-tune.md
file_path = ".langgraph-master/fine-tune.md"
with open(file_path, "r") as f:
    fine_tune_spec = f.read()

# Extract the following information:
# - Optimization goals (accuracy, latency, cost, etc.)
# - Evaluation methods (test cases, metrics, calculation methods)
# - Passing criteria (target values for each metric)
# - Test data location
```

**Typical fine-tune.md structure**:
```markdown
# Fine-Tuning Goals

## Optimization Objectives
- **Accuracy**: Improve user intent classification accuracy to 90% or higher
- **Latency**: Reduce response time to 2.0 seconds or less
- **Cost**: Reduce cost per request to $0.010 or less

## Evaluation Methods
- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
- **Execution Command**: uv run python -m src.evaluate
- **Evaluation Script**: tests/evaluation/evaluator.py

## Evaluation Metrics

### Accuracy
- Calculation method: (Correct count / Total cases) × 100
- Target value: 90% or higher

### Latency
- Calculation method: Average time per execution
- Target value: 2.0 seconds or less

### Cost
- Calculation method: Total API cost / Total requests
- Target value: $0.010 or less

## Passing Criteria
All evaluation metrics must achieve their target values
```

### Step 2: Identify Optimization Targets with Serena MCP

**Purpose**: Comprehensively identify nodes calling LLMs

**Execution Steps**:

1. **Search for LLM clients**
```python
# Use Serena MCP: find_symbol
# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.

patterns = [
    "ChatAnthropic",
    "ChatOpenAI",
    "ChatGoogleGenerativeAI",
    "ChatVertexAI"
]

llm_usages = []
for pattern in patterns:
    results = serena.find_symbol(
        name_path=pattern,
        substring_matching=True,
        include_body=False
    )
    llm_usages.extend(results)
```

2. **Identify prompt construction locations**
```python
# For each LLM call, investigate how prompts are constructed
for usage in llm_usages:
    # Get surrounding context with find_referencing_symbols
    context = serena.find_referencing_symbols(
        name_path=usage.name,
        relative_path=usage.file_path
    )

    # Identify prompt templates and message construction logic
    # - Use of ChatPromptTemplate
    # - SystemMessage, HumanMessage definitions
    # - Prompt construction with f-strings or format()
```

3. **Per-node analysis**
```python
# Analyze LLM usage patterns within each node function
# - Prompt clarity
# - Presence of few-shot examples
# - Structured output format
# - Parameter settings (temperature, max_tokens, etc.)
```

**Example Output**:
```markdown
## LLM Call Location Analysis

### 1. analyze_intent node
- **File**: src/nodes/analyzer.py
- **Line numbers**: 25-45
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
  ```python
  SystemMessage: "You are an intent analyzer..."
  HumanMessage: f"Analyze: {user_input}"
  ```
- **Improvement potential**: ⭐⭐⭐⭐⭐ (High)
  - Prompt is vague ("Analyze" criteria unclear)
  - No few-shot examples
  - Output format is free text
- **Estimated improvement effect**: Accuracy +10-15%

### 2. generate_response node
- **File**: src/nodes/generator.py
- **Line numbers**: 45-68
- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
- **Prompt structure**:
  ```python
  ChatPromptTemplate.from_messages([
      ("system", "Generate helpful response..."),
      ("human", "{context}\n\nQuestion: {question}")
  ])
  ```
- **Improvement potential**: ⭐⭐⭐ (Medium)
  - Prompt is structured but lacks conciseness instructions
  - No max_tokens limit → possibility of verbose output
- **Estimated improvement effect**: Latency -0.3-0.5s, Cost -20-30%
```

### Step 3: Create Optimization Target List

**Purpose**: Organize information to determine improvement priorities

**List Creation Template**:
```markdown
# Optimization Target List

## Node: analyze_intent

### Basic Information
- **File**: src/nodes/analyzer.py:25-45
- **Role**: Classify user input intent
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=1.0, max_tokens=default

### Current Prompt
```python
SystemMessage(content="You are an intent analyzer. Analyze user input.")
HumanMessage(content=f"Analyze: {user_input}")
```

### Issues
1. **Vague instructions**: Specific criteria for "Analyze" unclear
2. **No few-shot**: No expected output examples
3. **Undefined output format**: Unstructured free text
4. **High temperature**: 1.0 is too high for classification tasks

### Improvement Ideas
1. Specify concrete classification categories
2. Add 3-5 few-shot examples
3. Specify JSON output format
4. Lower temperature to 0.3-0.5

### Estimated Improvement Effect
- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
- **Latency**: ±0 (No change)
- **Cost**: ±0 (No change)

### Priority
⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement

---

## Node: generate_response

### Basic Information
- **File**: src/nodes/generator.py:45-68
- **Role**: Generate final user-facing response
- **LLM Model**: claude-3-5-sonnet-20241022
- **Current Parameters**: temperature=0.7, max_tokens=default

### Current Prompt
```python
ChatPromptTemplate.from_messages([
    ("system", "Generate helpful response based on context."),
    ("human", "{context}\n\nQuestion: {question}")
])
```

### Issues
1. **No verbosity control**: No conciseness instructions
2. **max_tokens not set**: Possibility of unnecessarily long output
3. **Undefined response style**: No tone or style specifications

### Improvement Ideas
1. Add length instructions like "be concise" "in 2-3 sentences"
2. Limit max_tokens to 500
3. Clarify response style ("friendly" "professional" etc.)

### Estimated Improvement Effect
- **Accuracy**: ±0 (No change)
- **Latency**: -0.3-0.5s (Due to reduced output tokens)
- **Cost**: -20-30% (Due to reduced token count)

### Priority
⭐⭐⭐ (Medium) - Improvement in latency and cost
```