Initial commit

2025-11-29 18:00:42 +08:00
commit cc49e355bc
37 changed files with 10917 additions and 0 deletions
--- a/skills/gpt-oss-troubleshooting/SKILL.md
+++ b/skills/gpt-oss-troubleshooting/SKILL.md
@@ -0,0 +1,156 @@
+---
+name: Troubleshooting gpt-oss and vLLM Errors
+description: Use when diagnosing openai_harmony.HarmonyError or gpt-oss tool calling issues with vLLM. Identifies error sources (vLLM server vs client), maps specific error messages to known GitHub issues, and provides configuration fixes for tool calling problems with gpt-oss models.
+---
+
+# Troubleshooting gpt-oss and vLLM Errors
+
+## When to Use This Skill
+
+Invoke this skill when you encounter:
+- `openai_harmony.HarmonyError` messages in any context
+- gpt-oss tool calling failures or unexpected behavior
+- Token parsing errors with vLLM serving gpt-oss models
+- Users asking about gpt-oss compatibility with frameworks like llama-stack
+
+## Critical First Step: Identify Error Source
+
+**IMPORTANT**: `openai_harmony.HarmonyError` messages originate from the **vLLM server**, NOT from client applications (like llama-stack, LangChain, etc.).
+
+### Error Source Identification
+
+1. **Check the error origin**:
+   - If error contains `openai_harmony.HarmonyError`, it's from vLLM's serving layer
+   - The client application is just reporting what vLLM returned
+   - Do NOT search the client codebase for fixes
+
+2. **Correct investigation path**:
+   - Search vLLM GitHub issues and PRs
+   - Check openai/harmony repository for parser issues
+   - Review vLLM server configuration and startup flags
+   - Examine HuggingFace model files (generation_config.json)
+
+## Common Error Patterns
+
+### Token Mismatch Errors
+
+**Error Pattern**: `Unexpected token X while expecting start token Y`
+
+**Example**: `Unexpected token 12606 while expecting start token 200006`
+
+**Meaning**:
+- vLLM expects special Harmony format control tokens
+- Model is generating regular text tokens instead
+- Token 12606 = "comment" (indicates model generating reasoning text instead of tool calls)
+
+**Known Issues**:
+- vLLM #22519: gpt-oss-20b tool_call token errors
+- vLLM #22515: Same error, fixed by updating generation_config.json
+
+**Fixes**:
+1. Update model files from HuggingFace (see reference/model-updates.md)
+2. Verify vLLM server flags for tool calling
+3. Check generation_config.json EOS tokens
+
+### Tool Calling Not Working
+
+**Symptoms**:
+- Model describes tools in text but doesn't call them
+- Empty `tool_calls=[]` arrays
+- Tool responses in wrong format
+
+**Root Causes**:
+1. Missing vLLM server flags
+2. Outdated model configuration files
+3. Configuration mismatch between client and server
+
+**Configuration Requirements**:
+
+vLLM server must be started with:
+```bash
+--tool-call-parser openai --enable-auto-tool-choice
+```
+
+For demo tool server:
+```bash
+--tool-server demo
+```
+
+For MCP tool servers:
+```bash
+--tool-server ip-1:port-1,ip-2:port-2
+```
+
+**Important**: Only `tool_choice='auto'` is supported.
+
+## Investigation Workflow
+
+1. **Identify the error message**:
+   - Copy the exact error text
+   - Note any token IDs mentioned
+
+2. **Search vLLM GitHub**:
+   - Use error text in issue search
+   - Include "gpt-oss" and model size (20b/120b)
+   - Check both open and closed issues
+
+3. **Check model configuration**:
+   - Verify generation_config.json is current
+   - Compare against latest HuggingFace version
+   - Look for recent commits that updated config
+
+4. **Review server configuration**:
+   - Check vLLM startup flags
+   - Verify tool-call-parser settings
+   - Confirm vLLM version compatibility
+
+5. **Check vLLM version**:
+   - Many tool calling issues resolved in recent vLLM releases
+   - Update to latest version if encountering errors
+   - Check vLLM changelog for gpt-oss-specific fixes
+
+## Quick Reference
+
+### Key Resources
+- vLLM gpt-oss recipe: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html
+- Common issues: See reference/known-issues.md
+- Model update procedure: See reference/model-updates.md
+
+### Diagnostic Commands
+
+Check vLLM server health:
+```bash
+curl http://localhost:8000/health
+```
+
+List available models:
+```bash
+curl http://localhost:8000/v1/models
+```
+
+Check vLLM version:
+```bash
+pip show vllm
+```
+
+## Progressive Disclosure
+
+For detailed information:
+- **Known GitHub issues**: See reference/known-issues.md
+- **Model file updates**: See reference/model-updates.md
+- **Tool calling configuration**: See reference/tool-calling-setup.md
+
+## Validation Steps
+
+After implementing fixes:
+1. Test simple tool calling with single function
+2. Verify Harmony format tokens in responses
+3. Check for token mismatch errors in logs
+4. Test multi-turn conversations with tools
+5. Monitor for "unexpected token" errors
+
+If errors persist:
+- Update vLLM to latest version
+- Check vLLM GitHub for recent fixes and PRs
+- Try different model variant (120b vs 20b)
+- Review vLLM logs for additional error context
--- a/skills/gpt-oss-troubleshooting/reference/known-issues.md
+++ b/skills/gpt-oss-troubleshooting/reference/known-issues.md
@@ -0,0 +1,111 @@
+# Known GitHub Issues for gpt-oss and vLLM
+
+## Active Issues
+
+### vLLM Repository
+
+#### Issue #22519: Token Error with gpt-oss-20b Tool Calls
+- **URL**: https://github.com/vllm-project/vllm/issues/22519
+- **Error**: `Unexpected token 12606 while expecting start token 200006`
+- **Status**: Open, To Triage
+- **Model**: gpt-oss-20b
+- **Symptoms**:
+  - Error occurs after model returns token 200012
+  - Token 12606 = "comment"
+  - Hypothesis: Model incorrectly splitting "commentary" into "comment" + "ary"
+- **Workaround**: None currently documented
+
+#### Issue #22515: Same Error, Fixed by Config Update
+- **URL**: https://github.com/vllm-project/vllm/issues/22515
+- **Error**: Same token parsing error
+- **Status**: Open
+- **Fix**: Update generation_config.json from HuggingFace
+  - Specific commit: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
+  - User andresC98 confirmed this resolved the issue
+- **Version**: Reported in vLLM v0.10.2
+
+#### Issue #22578: gpt-oss-120b Tool Call Support
+- **URL**: https://github.com/vllm-project/vllm/issues/22578
+- **Error**: Chat Completions endpoint tool_call not working
+- **Status**: Open
+- **Model**: gpt-oss-120b
+- **Symptoms**: Tool calling doesn't work correctly via /v1/chat/completions
+
+#### Issue #22337: Empty Tool Calls Array
+- **URL**: https://github.com/vllm-project/vllm/issues/22337
+- **Error**: tool_calls returning empty arrays
+- **Status**: Open
+- **Model**: gpt-oss-120b
+- **Symptoms**: Content appears in wrong format, tool_calls=[]
+
+#### Issue #23567: Unexpected Tokens in Message Header
+- **URL**: https://github.com/vllm-project/vllm/issues/23567
+- **Error**: `openai_harmony.HarmonyError: unexpected tokens remaining in message header`
+- **Status**: Open
+- **Symptoms**: Occurs in multi-turn conversations with gpt-oss-120b
+- **Version**: vLLM v0.10.1 and v0.10.1.1
+
+#### PR #24787: Tool Call Turn Tracking
+- **URL**: https://github.com/vllm-project/vllm/pull/24787
+- **Title**: Pass toolcall turn to kv cache manager
+- **Status**: Merged (September 2025)
+- **Description**: Adds toolcall_turn parameter for tracking turns in tool-calling conversations
+- **Impact**: Enables better prefix cache statistics for tool calling
+
+### HuggingFace Discussions
+
+#### gpt-oss-20b Discussion #80: Tool Calling Configuration
+- **URL**: https://huggingface.co/openai/gpt-oss-20b/discussions/80
+- **Summary**: Community discussion about tool calling best practices
+- **Key Findings**:
+  - Explicit tool listing in system prompt improves results
+  - Better results with tool_choice='required' or 'auto'
+  - Avoid requiring JSON response format
+  - Configuration and prompt engineering significantly impact tool calling behavior
+
+#### gpt-oss-120b Discussion #69: Chat Template Spec Errors
+- **URL**: https://huggingface.co/openai/gpt-oss-120b/discussions/69
+- **Summary**: Errors in chat template compared to spec
+- **Impact**: May affect tool calling format
+
+### openai/harmony Repository
+
+#### Issue #33: EOS Error While Waiting for Message Header
+- **URL**: https://github.com/openai/harmony/issues/33
+- **Error**: `HarmonyError: Unexpected EOS while waiting for message header to complete`
+- **Status**: Open
+- **Context**: Core Harmony parser issue affecting message parsing
+
+## Error Pattern Summary
+
+### Token Mismatch Errors
+- **Pattern**: `Unexpected token X while expecting start token Y`
+- **Root Cause**: Model generating text tokens instead of Harmony control tokens
+- **Common Triggers**: Tool calling, multi-turn conversations
+- **Primary Fix**: Update generation_config.json
+
+### Streaming Errors
+- **Pattern**: Parse failures during streaming responses
+- **Root Cause**: Incompatibility between request format and vLLM token generation
+- **Affected**: Both 20b and 120b models
+
+### Tool Calling Failures
+- **Pattern**: Empty tool_calls arrays or text descriptions instead of calls
+- **Root Cause**: Configuration issues or outdated model files
+- **Primary Fix**: Correct vLLM flags and update generation_config.json
+
+## Version Compatibility
+
+### vLLM Versions
+- **v0.10.2**: Multiple token parsing errors reported
+- **v0.10.1/v0.10.1.1**: Multi-turn conversation errors
+- **Latest**: Check for fixes in newer releases
+
+### Recommended Actions by Version
+- **Pre-v0.11**: Update to latest, refresh model files
+- **v0.11+**: Verify tool calling flags are set correctly
+
+## Cross-References
+
+- Model file updates: See model-updates.md
+- Tool calling configuration: See tool-calling-setup.md
--- a/skills/gpt-oss-troubleshooting/reference/model-updates.md
+++ b/skills/gpt-oss-troubleshooting/reference/model-updates.md
@@ -0,0 +1,216 @@
+# Updating gpt-oss Model Files
+
+## Why Update Model Files?
+
+The `openai_harmony.HarmonyError: Unexpected token` errors are often caused by outdated `generation_config.json` files. HuggingFace updates these files to fix token parsing issues.
+
+## Current Configuration Files
+
+### gpt-oss-20b generation_config.json
+
+Latest version includes:
+```json
+{
+  "bos_token_id": 199998,
+  "do_sample": true,
+  "eos_token_id": [
+    200002,
+    199999,
+    200012
+  ],
+  "pad_token_id": 199999,
+  "transformers_version": "4.55.0.dev0"
+}
+```
+
+**Key elements**:
+- **eos_token_id**: Multiple EOS tokens including 200012 (tool call completion)
+- **do_sample**: Enabled for generation diversity
+- **transformers_version**: Indicates compatible transformers version
+
+### gpt-oss-120b Critical Commit
+
+**Commit**: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
+- Fixed generation_config.json
+- Confirmed to resolve token parsing errors by user andresC98
+- Applied to gpt-oss-120b model
+
+## How to Update Model Files
+
+### Method 1: Re-download with HuggingFace CLI
+
+```bash
+# Install or update huggingface-hub
+pip install --upgrade huggingface-hub
+
+# For gpt-oss-20b
+huggingface-cli download openai/gpt-oss-20b --local-dir ./gpt-oss-20b
+
+# For gpt-oss-120b
+huggingface-cli download openai/gpt-oss-120b --local-dir ./gpt-oss-120b
+```
+
+### Method 2: Manual Update via Web
+
+1. Visit HuggingFace model page:
+   - gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b
+   - gpt-oss-120b: https://huggingface.co/openai/gpt-oss-120b
+
+2. Navigate to "Files and versions" tab
+
+3. Download latest `generation_config.json`
+
+4. Replace in your local model directory:
+   ```bash
+   # Find your model directory (varies by vLLM installation)
+   # Common locations:
+   # ~/.cache/huggingface/hub/models--openai--gpt-oss-20b/
+   # ./models/gpt-oss-20b/
+
+   # Replace the file
+   cp ~/Downloads/generation_config.json /path/to/model/directory/
+   ```
+
+### Method 3: Update with git (if model was cloned)
+
+```bash
+cd /path/to/model/directory
+git pull origin main
+```
+
+## Verification Steps
+
+After updating:
+
+1. **Check file contents**:
+   ```bash
+   cat generation_config.json
+   ```
+
+   Verify it matches the current version shown above.
+
+2. **Check modification date**:
+   ```bash
+   ls -l generation_config.json
+   ```
+
+   Should be recent (after the commit date).
+
+3. **Restart vLLM server**:
+   ```bash
+   # Stop existing server
+   # Start with correct flags (see tool-calling-setup.md)
+   vllm serve openai/gpt-oss-20b \
+     --tool-call-parser openai \
+     --enable-auto-tool-choice
+   ```
+
+4. **Test tool calling**:
+   ```python
+   from openai import OpenAI
+
+   client = OpenAI(base_url="http://localhost:8000/v1")
+
+   response = client.chat.completions.create(
+       model="openai/gpt-oss-20b",
+       messages=[{"role": "user", "content": "What's the weather?"}],
+       tools=[{
+           "type": "function",
+           "function": {
+               "name": "get_weather",
+               "description": "Get the weather",
+               "parameters": {
+                   "type": "object",
+                   "properties": {
+                       "location": {"type": "string"}
+                   }
+               }
+           }
+       }]
+   )
+
+   print(response)
+   ```
+
+## Troubleshooting Update Issues
+
+### vLLM Not Picking Up Changes
+
+**Symptom**: Updated files but still getting errors
+
+**Solutions**:
+1. Clear vLLM cache:
+   ```bash
+   rm -rf ~/.cache/vllm/
+   ```
+
+2. Restart vLLM with fresh model load:
+   ```bash
+   # Use --download-dir to force specific directory
+   vllm serve openai/gpt-oss-20b \
+     --download-dir /path/to/models \
+     --tool-call-parser openai \
+     --enable-auto-tool-choice
+   ```
+
+3. Check vLLM is loading the correct model directory:
+   - Look for model path in vLLM startup logs
+   - Verify it matches where you updated files
+
+### File Permission Issues
+
+```bash
+# Ensure files are readable
+chmod 644 generation_config.json
+
+# Check ownership
+ls -l generation_config.json
+```
+
+### Multiple Model Copies
+
+**Problem**: vLLM might be loading from a different location
+
+**Solution**:
+1. Find all copies:
+   ```bash
+   find ~/.cache -name "generation_config.json" -path "*/gpt-oss*"
+   ```
+
+2. Update all copies or remove duplicates
+
+3. Use explicit `--download-dir` flag when starting vLLM
+
+## Additional Files to Check
+
+While `generation_config.json` is the primary fix, also verify these files are current:
+
+### config.json
+Contains model architecture configuration
+
+### tokenizer_config.json
+Token encoding settings, including special tokens
+
+### special_tokens_map.json
+Maps special token strings to IDs
+
+**To update all**:
+```bash
+huggingface-cli download openai/gpt-oss-20b \
+  --local-dir ./gpt-oss-20b \
+  --force-download
+```
+
+## When to Update
+
+Update model files when:
+- Encountering token parsing errors
+- HuggingFace shows recent commits to model repo
+- vLLM error messages reference token IDs
+- After vLLM version upgrades
+- Community reports fixes via file updates
+
+## Cross-References
+
+- Known issues: See known-issues.md
+- vLLM configuration: See tool-calling-setup.md
--- a/skills/gpt-oss-troubleshooting/reference/tool-calling-setup.md
+++ b/skills/gpt-oss-troubleshooting/reference/tool-calling-setup.md
@@ -0,0 +1,299 @@
+# Tool Calling Configuration for gpt-oss with vLLM
+
+## Required vLLM Server Flags
+
+For gpt-oss tool calling to work, vLLM must be started with specific flags.
+
+### Minimal Configuration
+
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice
+```
+
+### Full Configuration with Tool Server
+
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --tool-server demo \
+  --max-model-len 8192 \
+  --dtype auto
+```
+
+## Flag Explanations
+
+### --tool-call-parser openai
+- **Required**: Yes
+- **Purpose**: Uses OpenAI-compatible tool calling format
+- **Effect**: Enables proper parsing of tool call tokens
+- **Alternatives**: None for gpt-oss compatibility
+
+### --enable-auto-tool-choice
+- **Required**: Yes
+- **Purpose**: Allows automatic tool selection
+- **Effect**: Model can choose which tool to call
+- **Note**: Only `tool_choice='auto'` is supported
+
+### --tool-server
+- **Required**: Optional, but needed for demo tools
+- **Options**:
+  - `demo`: Built-in demo tools (browser, Python interpreter)
+  - `ip:port`: Custom MCP tool server
+  - Multiple servers: `ip1:port1,ip2:port2`
+
+### --max-model-len
+- **Required**: No
+- **Purpose**: Sets maximum context length
+- **Recommended**: 8192 or higher for tool calling contexts
+- **Effect**: Prevents truncation during multi-turn tool conversations
+
+## Tool Server Options
+
+### Demo Tool Server
+
+Requires gpt-oss library:
+```bash
+pip install gpt-oss
+```
+
+Provides:
+- Web browser tool
+- Python interpreter tool
+
+Start command:
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --tool-server demo
+```
+
+### MCP Tool Servers
+
+Start vLLM with MCP server URLs:
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --tool-server localhost:5000,localhost:5001
+```
+
+Requirements:
+- MCP servers must be running before vLLM starts
+- Must implement MCP protocol
+- Should return results in expected format
+
+### No Tool Server (Client-Managed Tools)
+
+For client-side tool management (e.g., llama-stack with MCP):
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice
+```
+
+Tools are provided via API request, not server config.
+
+## Environment Variables
+
+### Search Tools
+
+For demo tool server with search:
+```bash
+export EXA_API_KEY="your-exa-api-key"
+```
+
+### Python Execution
+
+For safe Python execution (recommended for demo):
+```bash
+export PYTHON_EXECUTION_BACKEND=dangerously_use_uv
+```
+
+**Warning**: Demo Python execution is for testing only.
+
+## Client Configuration
+
+### OpenAI-Compatible Clients
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="not-needed"  # vLLM doesn't require auth by default
+)
+
+response = client.chat.completions.create(
+    model="openai/gpt-oss-20b",
+    messages=[{"role": "user", "content": "What's 2+2?"}],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "calculator",
+            "description": "Perform calculations",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "expression": {"type": "string"}
+                },
+                "required": ["expression"]
+            }
+        }
+    }],
+    tool_choice="auto"  # MUST be 'auto' - only supported value
+)
+```
+
+### llama-stack Configuration
+
+Example run.yaml for llama-stack with vLLM:
+```yaml
+inference:
+  - provider_id: vllm-provider
+    provider_type: remote::vllm
+    config:
+      url: http://localhost:8000/v1
+      # No auth_credential needed if vLLM has no auth
+
+tool_runtime:
+  - provider_id: mcp-provider
+    provider_type: mcp
+    config:
+      servers:
+        - server_name: my-tools
+          url: http://localhost:5000
+```
+
+## Common Configuration Issues
+
+### Issue: tool_choice Not Working
+
+**Symptom**: Error about unsupported tool_choice value
+
+**Solution**: Only use `tool_choice="auto"`, other values not supported:
+```python
+# GOOD
+tool_choice="auto"
+
+# BAD - will fail
+tool_choice="required"
+tool_choice={"type": "function", "function": {"name": "my_func"}}
+```
+
+### Issue: Tools Not Being Called
+
+**Symptoms**:
+- Model describes tool usage in text
+- No tool_calls in response
+- Empty tool_calls array
+
+**Checklist**:
+1. Verify `--tool-call-parser openai` flag is set
+2. Verify `--enable-auto-tool-choice` flag is set
+3. Check generation_config.json is up to date (see model-updates.md)
+4. Try simpler tool schemas first
+5. Check vLLM logs for parsing errors
+
+### Issue: Token Parsing Errors
+
+**Error**: `openai_harmony.HarmonyError: Unexpected token X`
+
+**Solutions**:
+1. Update model files (see model-updates.md)
+2. Verify vLLM version is recent
+3. Check vLLM startup logs for warnings
+4. Restart vLLM server after any config changes
+
+## Performance Tuning
+
+### GPU Memory
+
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --gpu-memory-utilization 0.9
+```
+
+### Tensor Parallelism
+
+For multi-GPU:
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --tensor-parallel-size 2
+```
+
+### Batching
+
+```bash
+vllm serve openai/gpt-oss-20b \
+  --tool-call-parser openai \
+  --enable-auto-tool-choice \
+  --max-num-batched-tokens 8192 \
+  --max-num-seqs 256
+```
+
+## Testing Your Configuration
+
+### Basic Test
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-20b",
+    "messages": [{"role": "user", "content": "Calculate 15 * 7"}],
+    "tools": [{
+      "type": "function",
+      "function": {
+        "name": "calculator",
+        "description": "Perform math",
+        "parameters": {
+          "type": "object",
+          "properties": {"expr": {"type": "string"}}
+        }
+      }
+    }],
+    "tool_choice": "auto"
+  }'
+```
+
+### Expected Response
+
+Success indicates tool_calls array with function call:
+```json
+{
+  "choices": [{
+    "message": {
+      "role": "assistant",
+      "content": null,
+      "tool_calls": [{
+        "id": "call_123",
+        "type": "function",
+        "function": {
+          "name": "calculator",
+          "arguments": "{\"expr\": \"15 * 7\"}"
+        }
+      }]
+    }
+  }]
+}
+```
+
+### Failure Indicators
+
+- `content` field has text describing calculation instead of null
+- `tool_calls` is empty or null
+- Error in response about tool parsing
+- HarmonyError in vLLM logs
+
+## Cross-References
+
+- Model file updates: See model-updates.md
+- Known issues: See known-issues.md