Initial commit

2025-11-29 18:45:58 +08:00
commit 4b6db3349f
68 changed files with 15165 additions and 0 deletions
--- a/skills/fine-tune/workflow_phase1.md
+++ b/skills/fine-tune/workflow_phase1.md
@@ -0,0 +1,229 @@
+# Phase 1: Preparation and Analysis
+
+Preparation phase to clarify optimization direction and identify targets for improvement.
+
+**Time Required**: 30 minutes - 1 hour
+
+**📋 Related Documents**: [Overall Workflow](./workflow.md) | [Practical Examples](./examples.md)
+
+---
+
+## Phase 1: Preparation and Analysis
+
+### Step 1: Read and Understand fine-tune.md
+
+**Purpose**: Clarify optimization direction
+
+**Execution**:
+```python
+# Read .langgraph-master/fine-tune.md
+file_path = ".langgraph-master/fine-tune.md"
+with open(file_path, "r") as f:
+    fine_tune_spec = f.read()
+
+# Extract the following information:
+# - Optimization goals (accuracy, latency, cost, etc.)
+# - Evaluation methods (test cases, metrics, calculation methods)
+# - Passing criteria (target values for each metric)
+# - Test data location
+```
+
+**Typical fine-tune.md structure**:
+```markdown
+# Fine-Tuning Goals
+
+## Optimization Objectives
+- **Accuracy**: Improve user intent classification accuracy to 90% or higher
+- **Latency**: Reduce response time to 2.0 seconds or less
+- **Cost**: Reduce cost per request to $0.010 or less
+
+## Evaluation Methods
+- **Test Cases**: tests/evaluation/test_cases.json (20 cases)
+- **Execution Command**: uv run python -m src.evaluate
+- **Evaluation Script**: tests/evaluation/evaluator.py
+
+## Evaluation Metrics
+
+### Accuracy
+- Calculation method: (Correct count / Total cases) × 100
+- Target value: 90% or higher
+
+### Latency
+- Calculation method: Average time per execution
+- Target value: 2.0 seconds or less
+
+### Cost
+- Calculation method: Total API cost / Total requests
+- Target value: $0.010 or less
+
+## Passing Criteria
+All evaluation metrics must achieve their target values
+```
+
+### Step 2: Identify Optimization Targets with Serena MCP
+
+**Purpose**: Comprehensively identify nodes calling LLMs
+
+**Execution Steps**:
+
+1. **Search for LLM clients**
+```python
+# Use Serena MCP: find_symbol
+# Search for ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.
+
+patterns = [
+    "ChatAnthropic",
+    "ChatOpenAI",
+    "ChatGoogleGenerativeAI",
+    "ChatVertexAI"
+]
+
+llm_usages = []
+for pattern in patterns:
+    results = serena.find_symbol(
+        name_path=pattern,
+        substring_matching=True,
+        include_body=False
+    )
+    llm_usages.extend(results)
+```
+
+2. **Identify prompt construction locations**
+```python
+# For each LLM call, investigate how prompts are constructed
+for usage in llm_usages:
+    # Get surrounding context with find_referencing_symbols
+    context = serena.find_referencing_symbols(
+        name_path=usage.name,
+        relative_path=usage.file_path
+    )
+
+    # Identify prompt templates and message construction logic
+    # - Use of ChatPromptTemplate
+    # - SystemMessage, HumanMessage definitions
+    # - Prompt construction with f-strings or format()
+```
+
+3. **Per-node analysis**
+```python
+# Analyze LLM usage patterns within each node function
+# - Prompt clarity
+# - Presence of few-shot examples
+# - Structured output format
+# - Parameter settings (temperature, max_tokens, etc.)
+```
+
+**Example Output**:
+```markdown
+## LLM Call Location Analysis
+
+### 1. analyze_intent node
+- **File**: src/nodes/analyzer.py
+- **Line numbers**: 25-45
+- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
+- **Prompt structure**:
+  ```python
+  SystemMessage: "You are an intent analyzer..."
+  HumanMessage: f"Analyze: {user_input}"
+  ```
+- **Improvement potential**: ⭐⭐⭐⭐⭐ (High)
+  - Prompt is vague ("Analyze" criteria unclear)
+  - No few-shot examples
+  - Output format is free text
+- **Estimated improvement effect**: Accuracy +10-15%
+
+### 2. generate_response node
+- **File**: src/nodes/generator.py
+- **Line numbers**: 45-68
+- **LLM**: ChatAnthropic(model="claude-3-5-sonnet-20241022")
+- **Prompt structure**:
+  ```python
+  ChatPromptTemplate.from_messages([
+      ("system", "Generate helpful response..."),
+      ("human", "{context}\n\nQuestion: {question}")
+  ])
+  ```
+- **Improvement potential**: ⭐⭐⭐ (Medium)
+  - Prompt is structured but lacks conciseness instructions
+  - No max_tokens limit → possibility of verbose output
+- **Estimated improvement effect**: Latency -0.3-0.5s, Cost -20-30%
+```
+
+### Step 3: Create Optimization Target List
+
+**Purpose**: Organize information to determine improvement priorities
+
+**List Creation Template**:
+```markdown
+# Optimization Target List
+
+## Node: analyze_intent
+
+### Basic Information
+- **File**: src/nodes/analyzer.py:25-45
+- **Role**: Classify user input intent
+- **LLM Model**: claude-3-5-sonnet-20241022
+- **Current Parameters**: temperature=1.0, max_tokens=default
+
+### Current Prompt
+```python
+SystemMessage(content="You are an intent analyzer. Analyze user input.")
+HumanMessage(content=f"Analyze: {user_input}")
+```
+
+### Issues
+1. **Vague instructions**: Specific criteria for "Analyze" unclear
+2. **No few-shot**: No expected output examples
+3. **Undefined output format**: Unstructured free text
+4. **High temperature**: 1.0 is too high for classification tasks
+
+### Improvement Ideas
+1. Specify concrete classification categories
+2. Add 3-5 few-shot examples
+3. Specify JSON output format
+4. Lower temperature to 0.3-0.5
+
+### Estimated Improvement Effect
+- **Accuracy**: +10-15% (Current misclassification 20% → 5-10%)
+- **Latency**: ±0 (No change)
+- **Cost**: ±0 (No change)
+
+### Priority
+⭐⭐⭐⭐⭐ (Highest) - Direct impact on accuracy improvement
+
+---
+
+## Node: generate_response
+
+### Basic Information
+- **File**: src/nodes/generator.py:45-68
+- **Role**: Generate final user-facing response
+- **LLM Model**: claude-3-5-sonnet-20241022
+- **Current Parameters**: temperature=0.7, max_tokens=default
+
+### Current Prompt
+```python
+ChatPromptTemplate.from_messages([
+    ("system", "Generate helpful response based on context."),
+    ("human", "{context}\n\nQuestion: {question}")
+])
+```
+
+### Issues
+1. **No verbosity control**: No conciseness instructions
+2. **max_tokens not set**: Possibility of unnecessarily long output
+3. **Undefined response style**: No tone or style specifications
+
+### Improvement Ideas
+1. Add length instructions like "be concise" "in 2-3 sentences"
+2. Limit max_tokens to 500
+3. Clarify response style ("friendly" "professional" etc.)
+
+### Estimated Improvement Effect
+- **Accuracy**: ±0 (No change)
+- **Latency**: -0.3-0.5s (Due to reduced output tokens)
+- **Cost**: -20-30% (Due to reduced token count)
+
+### Priority
+⭐⭐⭐ (Medium) - Improvement in latency and cost
+```