Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:00:42 +08:00
commit cc49e355bc
37 changed files with 10917 additions and 0 deletions

View File

@@ -0,0 +1,156 @@
---
name: Troubleshooting gpt-oss and vLLM Errors
description: Use when diagnosing openai_harmony.HarmonyError or gpt-oss tool calling issues with vLLM. Identifies error sources (vLLM server vs client), maps specific error messages to known GitHub issues, and provides configuration fixes for tool calling problems with gpt-oss models.
---
# Troubleshooting gpt-oss and vLLM Errors
## When to Use This Skill
Invoke this skill when you encounter:
- `openai_harmony.HarmonyError` messages in any context
- gpt-oss tool calling failures or unexpected behavior
- Token parsing errors with vLLM serving gpt-oss models
- Users asking about gpt-oss compatibility with frameworks like llama-stack
## Critical First Step: Identify Error Source
**IMPORTANT**: `openai_harmony.HarmonyError` messages originate from the **vLLM server**, NOT from client applications (like llama-stack, LangChain, etc.).
### Error Source Identification
1. **Check the error origin**:
- If error contains `openai_harmony.HarmonyError`, it's from vLLM's serving layer
- The client application is just reporting what vLLM returned
- Do NOT search the client codebase for fixes
2. **Correct investigation path**:
- Search vLLM GitHub issues and PRs
- Check openai/harmony repository for parser issues
- Review vLLM server configuration and startup flags
- Examine HuggingFace model files (generation_config.json)
## Common Error Patterns
### Token Mismatch Errors
**Error Pattern**: `Unexpected token X while expecting start token Y`
**Example**: `Unexpected token 12606 while expecting start token 200006`
**Meaning**:
- vLLM expects special Harmony format control tokens
- Model is generating regular text tokens instead
- Token 12606 = "comment" (indicates model generating reasoning text instead of tool calls)
**Known Issues**:
- vLLM #22519: gpt-oss-20b tool_call token errors
- vLLM #22515: Same error, fixed by updating generation_config.json
**Fixes**:
1. Update model files from HuggingFace (see reference/model-updates.md)
2. Verify vLLM server flags for tool calling
3. Check generation_config.json EOS tokens
### Tool Calling Not Working
**Symptoms**:
- Model describes tools in text but doesn't call them
- Empty `tool_calls=[]` arrays
- Tool responses in wrong format
**Root Causes**:
1. Missing vLLM server flags
2. Outdated model configuration files
3. Configuration mismatch between client and server
**Configuration Requirements**:
vLLM server must be started with:
```bash
--tool-call-parser openai --enable-auto-tool-choice
```
For demo tool server:
```bash
--tool-server demo
```
For MCP tool servers:
```bash
--tool-server ip-1:port-1,ip-2:port-2
```
**Important**: Only `tool_choice='auto'` is supported.
## Investigation Workflow
1. **Identify the error message**:
- Copy the exact error text
- Note any token IDs mentioned
2. **Search vLLM GitHub**:
- Use error text in issue search
- Include "gpt-oss" and model size (20b/120b)
- Check both open and closed issues
3. **Check model configuration**:
- Verify generation_config.json is current
- Compare against latest HuggingFace version
- Look for recent commits that updated config
4. **Review server configuration**:
- Check vLLM startup flags
- Verify tool-call-parser settings
- Confirm vLLM version compatibility
5. **Check vLLM version**:
- Many tool calling issues resolved in recent vLLM releases
- Update to latest version if encountering errors
- Check vLLM changelog for gpt-oss-specific fixes
## Quick Reference
### Key Resources
- vLLM gpt-oss recipe: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html
- Common issues: See reference/known-issues.md
- Model update procedure: See reference/model-updates.md
### Diagnostic Commands
Check vLLM server health:
```bash
curl http://localhost:8000/health
```
List available models:
```bash
curl http://localhost:8000/v1/models
```
Check vLLM version:
```bash
pip show vllm
```
## Progressive Disclosure
For detailed information:
- **Known GitHub issues**: See reference/known-issues.md
- **Model file updates**: See reference/model-updates.md
- **Tool calling configuration**: See reference/tool-calling-setup.md
## Validation Steps
After implementing fixes:
1. Test simple tool calling with single function
2. Verify Harmony format tokens in responses
3. Check for token mismatch errors in logs
4. Test multi-turn conversations with tools
5. Monitor for "unexpected token" errors
If errors persist:
- Update vLLM to latest version
- Check vLLM GitHub for recent fixes and PRs
- Try different model variant (120b vs 20b)
- Review vLLM logs for additional error context

View File

@@ -0,0 +1,111 @@
# Known GitHub Issues for gpt-oss and vLLM
## Active Issues
### vLLM Repository
#### Issue #22519: Token Error with gpt-oss-20b Tool Calls
- **URL**: https://github.com/vllm-project/vllm/issues/22519
- **Error**: `Unexpected token 12606 while expecting start token 200006`
- **Status**: Open, To Triage
- **Model**: gpt-oss-20b
- **Symptoms**:
- Error occurs after model returns token 200012
- Token 12606 = "comment"
- Hypothesis: Model incorrectly splitting "commentary" into "comment" + "ary"
- **Workaround**: None currently documented
#### Issue #22515: Same Error, Fixed by Config Update
- **URL**: https://github.com/vllm-project/vllm/issues/22515
- **Error**: Same token parsing error
- **Status**: Open
- **Fix**: Update generation_config.json from HuggingFace
- Specific commit: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
- User andresC98 confirmed this resolved the issue
- **Version**: Reported in vLLM v0.10.2
#### Issue #22578: gpt-oss-120b Tool Call Support
- **URL**: https://github.com/vllm-project/vllm/issues/22578
- **Error**: Chat Completions endpoint tool_call not working
- **Status**: Open
- **Model**: gpt-oss-120b
- **Symptoms**: Tool calling doesn't work correctly via /v1/chat/completions
#### Issue #22337: Empty Tool Calls Array
- **URL**: https://github.com/vllm-project/vllm/issues/22337
- **Error**: tool_calls returning empty arrays
- **Status**: Open
- **Model**: gpt-oss-120b
- **Symptoms**: Content appears in wrong format, tool_calls=[]
#### Issue #23567: Unexpected Tokens in Message Header
- **URL**: https://github.com/vllm-project/vllm/issues/23567
- **Error**: `openai_harmony.HarmonyError: unexpected tokens remaining in message header`
- **Status**: Open
- **Symptoms**: Occurs in multi-turn conversations with gpt-oss-120b
- **Version**: vLLM v0.10.1 and v0.10.1.1
#### PR #24787: Tool Call Turn Tracking
- **URL**: https://github.com/vllm-project/vllm/pull/24787
- **Title**: Pass toolcall turn to kv cache manager
- **Status**: Merged (September 2025)
- **Description**: Adds toolcall_turn parameter for tracking turns in tool-calling conversations
- **Impact**: Enables better prefix cache statistics for tool calling
### HuggingFace Discussions
#### gpt-oss-20b Discussion #80: Tool Calling Configuration
- **URL**: https://huggingface.co/openai/gpt-oss-20b/discussions/80
- **Summary**: Community discussion about tool calling best practices
- **Key Findings**:
- Explicit tool listing in system prompt improves results
- Better results with tool_choice='required' or 'auto'
- Avoid requiring JSON response format
- Configuration and prompt engineering significantly impact tool calling behavior
#### gpt-oss-120b Discussion #69: Chat Template Spec Errors
- **URL**: https://huggingface.co/openai/gpt-oss-120b/discussions/69
- **Summary**: Errors in chat template compared to spec
- **Impact**: May affect tool calling format
### openai/harmony Repository
#### Issue #33: EOS Error While Waiting for Message Header
- **URL**: https://github.com/openai/harmony/issues/33
- **Error**: `HarmonyError: Unexpected EOS while waiting for message header to complete`
- **Status**: Open
- **Context**: Core Harmony parser issue affecting message parsing
## Error Pattern Summary
### Token Mismatch Errors
- **Pattern**: `Unexpected token X while expecting start token Y`
- **Root Cause**: Model generating text tokens instead of Harmony control tokens
- **Common Triggers**: Tool calling, multi-turn conversations
- **Primary Fix**: Update generation_config.json
### Streaming Errors
- **Pattern**: Parse failures during streaming responses
- **Root Cause**: Incompatibility between request format and vLLM token generation
- **Affected**: Both 20b and 120b models
### Tool Calling Failures
- **Pattern**: Empty tool_calls arrays or text descriptions instead of calls
- **Root Cause**: Configuration issues or outdated model files
- **Primary Fix**: Correct vLLM flags and update generation_config.json
## Version Compatibility
### vLLM Versions
- **v0.10.2**: Multiple token parsing errors reported
- **v0.10.1/v0.10.1.1**: Multi-turn conversation errors
- **Latest**: Check for fixes in newer releases
### Recommended Actions by Version
- **Pre-v0.11**: Update to latest, refresh model files
- **v0.11+**: Verify tool calling flags are set correctly
## Cross-References
- Model file updates: See model-updates.md
- Tool calling configuration: See tool-calling-setup.md

View File

@@ -0,0 +1,216 @@
# Updating gpt-oss Model Files
## Why Update Model Files?
The `openai_harmony.HarmonyError: Unexpected token` errors are often caused by outdated `generation_config.json` files. HuggingFace updates these files to fix token parsing issues.
## Current Configuration Files
### gpt-oss-20b generation_config.json
Latest version includes:
```json
{
"bos_token_id": 199998,
"do_sample": true,
"eos_token_id": [
200002,
199999,
200012
],
"pad_token_id": 199999,
"transformers_version": "4.55.0.dev0"
}
```
**Key elements**:
- **eos_token_id**: Multiple EOS tokens including 200012 (tool call completion)
- **do_sample**: Enabled for generation diversity
- **transformers_version**: Indicates compatible transformers version
### gpt-oss-120b Critical Commit
**Commit**: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
- Fixed generation_config.json
- Confirmed to resolve token parsing errors by user andresC98
- Applied to gpt-oss-120b model
## How to Update Model Files
### Method 1: Re-download with HuggingFace CLI
```bash
# Install or update huggingface-hub
pip install --upgrade huggingface-hub
# For gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --local-dir ./gpt-oss-20b
# For gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --local-dir ./gpt-oss-120b
```
### Method 2: Manual Update via Web
1. Visit HuggingFace model page:
- gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b
- gpt-oss-120b: https://huggingface.co/openai/gpt-oss-120b
2. Navigate to "Files and versions" tab
3. Download latest `generation_config.json`
4. Replace in your local model directory:
```bash
# Find your model directory (varies by vLLM installation)
# Common locations:
# ~/.cache/huggingface/hub/models--openai--gpt-oss-20b/
# ./models/gpt-oss-20b/
# Replace the file
cp ~/Downloads/generation_config.json /path/to/model/directory/
```
### Method 3: Update with git (if model was cloned)
```bash
cd /path/to/model/directory
git pull origin main
```
## Verification Steps
After updating:
1. **Check file contents**:
```bash
cat generation_config.json
```
Verify it matches the current version shown above.
2. **Check modification date**:
```bash
ls -l generation_config.json
```
Should be recent (after the commit date).
3. **Restart vLLM server**:
```bash
# Stop existing server
# Start with correct flags (see tool-calling-setup.md)
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
4. **Test tool calling**:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's the weather?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
)
print(response)
```
## Troubleshooting Update Issues
### vLLM Not Picking Up Changes
**Symptom**: Updated files but still getting errors
**Solutions**:
1. Clear vLLM cache:
```bash
rm -rf ~/.cache/vllm/
```
2. Restart vLLM with fresh model load:
```bash
# Use --download-dir to force specific directory
vllm serve openai/gpt-oss-20b \
--download-dir /path/to/models \
--tool-call-parser openai \
--enable-auto-tool-choice
```
3. Check vLLM is loading the correct model directory:
- Look for model path in vLLM startup logs
- Verify it matches where you updated files
### File Permission Issues
```bash
# Ensure files are readable
chmod 644 generation_config.json
# Check ownership
ls -l generation_config.json
```
### Multiple Model Copies
**Problem**: vLLM might be loading from a different location
**Solution**:
1. Find all copies:
```bash
find ~/.cache -name "generation_config.json" -path "*/gpt-oss*"
```
2. Update all copies or remove duplicates
3. Use explicit `--download-dir` flag when starting vLLM
## Additional Files to Check
While `generation_config.json` is the primary fix, also verify these files are current:
### config.json
Contains model architecture configuration
### tokenizer_config.json
Token encoding settings, including special tokens
### special_tokens_map.json
Maps special token strings to IDs
**To update all**:
```bash
huggingface-cli download openai/gpt-oss-20b \
--local-dir ./gpt-oss-20b \
--force-download
```
## When to Update
Update model files when:
- Encountering token parsing errors
- HuggingFace shows recent commits to model repo
- vLLM error messages reference token IDs
- After vLLM version upgrades
- Community reports fixes via file updates
## Cross-References
- Known issues: See known-issues.md
- vLLM configuration: See tool-calling-setup.md

View File

@@ -0,0 +1,299 @@
# Tool Calling Configuration for gpt-oss with vLLM
## Required vLLM Server Flags
For gpt-oss tool calling to work, vLLM must be started with specific flags.
### Minimal Configuration
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
### Full Configuration with Tool Server
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo \
--max-model-len 8192 \
--dtype auto
```
## Flag Explanations
### --tool-call-parser openai
- **Required**: Yes
- **Purpose**: Uses OpenAI-compatible tool calling format
- **Effect**: Enables proper parsing of tool call tokens
- **Alternatives**: None for gpt-oss compatibility
### --enable-auto-tool-choice
- **Required**: Yes
- **Purpose**: Allows automatic tool selection
- **Effect**: Model can choose which tool to call
- **Note**: Only `tool_choice='auto'` is supported
### --tool-server
- **Required**: Optional, but needed for demo tools
- **Options**:
- `demo`: Built-in demo tools (browser, Python interpreter)
- `ip:port`: Custom MCP tool server
- Multiple servers: `ip1:port1,ip2:port2`
### --max-model-len
- **Required**: No
- **Purpose**: Sets maximum context length
- **Recommended**: 8192 or higher for tool calling contexts
- **Effect**: Prevents truncation during multi-turn tool conversations
## Tool Server Options
### Demo Tool Server
Requires gpt-oss library:
```bash
pip install gpt-oss
```
Provides:
- Web browser tool
- Python interpreter tool
Start command:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo
```
### MCP Tool Servers
Start vLLM with MCP server URLs:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server localhost:5000,localhost:5001
```
Requirements:
- MCP servers must be running before vLLM starts
- Must implement MCP protocol
- Should return results in expected format
### No Tool Server (Client-Managed Tools)
For client-side tool management (e.g., llama-stack with MCP):
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
Tools are provided via API request, not server config.
## Environment Variables
### Search Tools
For demo tool server with search:
```bash
export EXA_API_KEY="your-exa-api-key"
```
### Python Execution
For safe Python execution (recommended for demo):
```bash
export PYTHON_EXECUTION_BACKEND=dangerously_use_uv
```
**Warning**: Demo Python execution is for testing only.
## Client Configuration
### OpenAI-Compatible Clients
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's 2+2?"}],
tools=[{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
},
"required": ["expression"]
}
}
}],
tool_choice="auto" # MUST be 'auto' - only supported value
)
```
### llama-stack Configuration
Example run.yaml for llama-stack with vLLM:
```yaml
inference:
- provider_id: vllm-provider
provider_type: remote::vllm
config:
url: http://localhost:8000/v1
# No auth_credential needed if vLLM has no auth
tool_runtime:
- provider_id: mcp-provider
provider_type: mcp
config:
servers:
- server_name: my-tools
url: http://localhost:5000
```
## Common Configuration Issues
### Issue: tool_choice Not Working
**Symptom**: Error about unsupported tool_choice value
**Solution**: Only use `tool_choice="auto"`, other values not supported:
```python
# GOOD
tool_choice="auto"
# BAD - will fail
tool_choice="required"
tool_choice={"type": "function", "function": {"name": "my_func"}}
```
### Issue: Tools Not Being Called
**Symptoms**:
- Model describes tool usage in text
- No tool_calls in response
- Empty tool_calls array
**Checklist**:
1. Verify `--tool-call-parser openai` flag is set
2. Verify `--enable-auto-tool-choice` flag is set
3. Check generation_config.json is up to date (see model-updates.md)
4. Try simpler tool schemas first
5. Check vLLM logs for parsing errors
### Issue: Token Parsing Errors
**Error**: `openai_harmony.HarmonyError: Unexpected token X`
**Solutions**:
1. Update model files (see model-updates.md)
2. Verify vLLM version is recent
3. Check vLLM startup logs for warnings
4. Restart vLLM server after any config changes
## Performance Tuning
### GPU Memory
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.9
```
### Tensor Parallelism
For multi-GPU:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2
```
### Batching
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--max-num-batched-tokens 8192 \
--max-num-seqs 256
```
## Testing Your Configuration
### Basic Test
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "Calculate 15 * 7"}],
"tools": [{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform math",
"parameters": {
"type": "object",
"properties": {"expr": {"type": "string"}}
}
}
}],
"tool_choice": "auto"
}'
```
### Expected Response
Success indicates tool_calls array with function call:
```json
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"expr\": \"15 * 7\"}"
}
}]
}
}]
}
```
### Failure Indicators
- `content` field has text describing calculation instead of null
- `tool_calls` is empty or null
- Error in response about tool parsing
- HarmonyError in vLLM logs
## Cross-References
- Model file updates: See model-updates.md
- Known issues: See known-issues.md