Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:00:42 +08:00
commit cc49e355bc
37 changed files with 10917 additions and 0 deletions

View File

@@ -0,0 +1,111 @@
# Known GitHub Issues for gpt-oss and vLLM
## Active Issues
### vLLM Repository
#### Issue #22519: Token Error with gpt-oss-20b Tool Calls
- **URL**: https://github.com/vllm-project/vllm/issues/22519
- **Error**: `Unexpected token 12606 while expecting start token 200006`
- **Status**: Open, To Triage
- **Model**: gpt-oss-20b
- **Symptoms**:
- Error occurs after model returns token 200012
- Token 12606 = "comment"
- Hypothesis: Model incorrectly splitting "commentary" into "comment" + "ary"
- **Workaround**: None currently documented
#### Issue #22515: Same Error, Fixed by Config Update
- **URL**: https://github.com/vllm-project/vllm/issues/22515
- **Error**: Same token parsing error
- **Status**: Open
- **Fix**: Update generation_config.json from HuggingFace
- Specific commit: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
- User andresC98 confirmed this resolved the issue
- **Version**: Reported in vLLM v0.10.2
#### Issue #22578: gpt-oss-120b Tool Call Support
- **URL**: https://github.com/vllm-project/vllm/issues/22578
- **Error**: Chat Completions endpoint tool_call not working
- **Status**: Open
- **Model**: gpt-oss-120b
- **Symptoms**: Tool calling doesn't work correctly via /v1/chat/completions
#### Issue #22337: Empty Tool Calls Array
- **URL**: https://github.com/vllm-project/vllm/issues/22337
- **Error**: tool_calls returning empty arrays
- **Status**: Open
- **Model**: gpt-oss-120b
- **Symptoms**: Content appears in wrong format, tool_calls=[]
#### Issue #23567: Unexpected Tokens in Message Header
- **URL**: https://github.com/vllm-project/vllm/issues/23567
- **Error**: `openai_harmony.HarmonyError: unexpected tokens remaining in message header`
- **Status**: Open
- **Symptoms**: Occurs in multi-turn conversations with gpt-oss-120b
- **Version**: vLLM v0.10.1 and v0.10.1.1
#### PR #24787: Tool Call Turn Tracking
- **URL**: https://github.com/vllm-project/vllm/pull/24787
- **Title**: Pass toolcall turn to kv cache manager
- **Status**: Merged (September 2025)
- **Description**: Adds toolcall_turn parameter for tracking turns in tool-calling conversations
- **Impact**: Enables better prefix cache statistics for tool calling
### HuggingFace Discussions
#### gpt-oss-20b Discussion #80: Tool Calling Configuration
- **URL**: https://huggingface.co/openai/gpt-oss-20b/discussions/80
- **Summary**: Community discussion about tool calling best practices
- **Key Findings**:
- Explicit tool listing in system prompt improves results
- Better results with tool_choice='required' or 'auto'
- Avoid requiring JSON response format
- Configuration and prompt engineering significantly impact tool calling behavior
#### gpt-oss-120b Discussion #69: Chat Template Spec Errors
- **URL**: https://huggingface.co/openai/gpt-oss-120b/discussions/69
- **Summary**: Errors in chat template compared to spec
- **Impact**: May affect tool calling format
### openai/harmony Repository
#### Issue #33: EOS Error While Waiting for Message Header
- **URL**: https://github.com/openai/harmony/issues/33
- **Error**: `HarmonyError: Unexpected EOS while waiting for message header to complete`
- **Status**: Open
- **Context**: Core Harmony parser issue affecting message parsing
## Error Pattern Summary
### Token Mismatch Errors
- **Pattern**: `Unexpected token X while expecting start token Y`
- **Root Cause**: Model generating text tokens instead of Harmony control tokens
- **Common Triggers**: Tool calling, multi-turn conversations
- **Primary Fix**: Update generation_config.json
### Streaming Errors
- **Pattern**: Parse failures during streaming responses
- **Root Cause**: Incompatibility between request format and vLLM token generation
- **Affected**: Both 20b and 120b models
### Tool Calling Failures
- **Pattern**: Empty tool_calls arrays or text descriptions instead of calls
- **Root Cause**: Configuration issues or outdated model files
- **Primary Fix**: Correct vLLM flags and update generation_config.json
## Version Compatibility
### vLLM Versions
- **v0.10.2**: Multiple token parsing errors reported
- **v0.10.1/v0.10.1.1**: Multi-turn conversation errors
- **Latest**: Check for fixes in newer releases
### Recommended Actions by Version
- **Pre-v0.11**: Update to latest, refresh model files
- **v0.11+**: Verify tool calling flags are set correctly
## Cross-References
- Model file updates: See model-updates.md
- Tool calling configuration: See tool-calling-setup.md

View File

@@ -0,0 +1,216 @@
# Updating gpt-oss Model Files
## Why Update Model Files?
The `openai_harmony.HarmonyError: Unexpected token` errors are often caused by outdated `generation_config.json` files. HuggingFace updates these files to fix token parsing issues.
## Current Configuration Files
### gpt-oss-20b generation_config.json
Latest version includes:
```json
{
"bos_token_id": 199998,
"do_sample": true,
"eos_token_id": [
200002,
199999,
200012
],
"pad_token_id": 199999,
"transformers_version": "4.55.0.dev0"
}
```
**Key elements**:
- **eos_token_id**: Multiple EOS tokens including 200012 (tool call completion)
- **do_sample**: Enabled for generation diversity
- **transformers_version**: Indicates compatible transformers version
### gpt-oss-120b Critical Commit
**Commit**: 8b193b0ef83bd41b40eb71fee8f1432315e02a3e
- Fixed generation_config.json
- Confirmed to resolve token parsing errors by user andresC98
- Applied to gpt-oss-120b model
## How to Update Model Files
### Method 1: Re-download with HuggingFace CLI
```bash
# Install or update huggingface-hub
pip install --upgrade huggingface-hub
# For gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --local-dir ./gpt-oss-20b
# For gpt-oss-120b
huggingface-cli download openai/gpt-oss-120b --local-dir ./gpt-oss-120b
```
### Method 2: Manual Update via Web
1. Visit HuggingFace model page:
- gpt-oss-20b: https://huggingface.co/openai/gpt-oss-20b
- gpt-oss-120b: https://huggingface.co/openai/gpt-oss-120b
2. Navigate to "Files and versions" tab
3. Download latest `generation_config.json`
4. Replace in your local model directory:
```bash
# Find your model directory (varies by vLLM installation)
# Common locations:
# ~/.cache/huggingface/hub/models--openai--gpt-oss-20b/
# ./models/gpt-oss-20b/
# Replace the file
cp ~/Downloads/generation_config.json /path/to/model/directory/
```
### Method 3: Update with git (if model was cloned)
```bash
cd /path/to/model/directory
git pull origin main
```
## Verification Steps
After updating:
1. **Check file contents**:
```bash
cat generation_config.json
```
Verify it matches the current version shown above.
2. **Check modification date**:
```bash
ls -l generation_config.json
```
Should be recent (after the commit date).
3. **Restart vLLM server**:
```bash
# Stop existing server
# Start with correct flags (see tool-calling-setup.md)
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
4. **Test tool calling**:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's the weather?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
)
print(response)
```
## Troubleshooting Update Issues
### vLLM Not Picking Up Changes
**Symptom**: Updated files but still getting errors
**Solutions**:
1. Clear vLLM cache:
```bash
rm -rf ~/.cache/vllm/
```
2. Restart vLLM with fresh model load:
```bash
# Use --download-dir to force specific directory
vllm serve openai/gpt-oss-20b \
--download-dir /path/to/models \
--tool-call-parser openai \
--enable-auto-tool-choice
```
3. Check vLLM is loading the correct model directory:
- Look for model path in vLLM startup logs
- Verify it matches where you updated files
### File Permission Issues
```bash
# Ensure files are readable
chmod 644 generation_config.json
# Check ownership
ls -l generation_config.json
```
### Multiple Model Copies
**Problem**: vLLM might be loading from a different location
**Solution**:
1. Find all copies:
```bash
find ~/.cache -name "generation_config.json" -path "*/gpt-oss*"
```
2. Update all copies or remove duplicates
3. Use explicit `--download-dir` flag when starting vLLM
## Additional Files to Check
While `generation_config.json` is the primary fix, also verify these files are current:
### config.json
Contains model architecture configuration
### tokenizer_config.json
Token encoding settings, including special tokens
### special_tokens_map.json
Maps special token strings to IDs
**To update all**:
```bash
huggingface-cli download openai/gpt-oss-20b \
--local-dir ./gpt-oss-20b \
--force-download
```
## When to Update
Update model files when:
- Encountering token parsing errors
- HuggingFace shows recent commits to model repo
- vLLM error messages reference token IDs
- After vLLM version upgrades
- Community reports fixes via file updates
## Cross-References
- Known issues: See known-issues.md
- vLLM configuration: See tool-calling-setup.md

View File

@@ -0,0 +1,299 @@
# Tool Calling Configuration for gpt-oss with vLLM
## Required vLLM Server Flags
For gpt-oss tool calling to work, vLLM must be started with specific flags.
### Minimal Configuration
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
### Full Configuration with Tool Server
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo \
--max-model-len 8192 \
--dtype auto
```
## Flag Explanations
### --tool-call-parser openai
- **Required**: Yes
- **Purpose**: Uses OpenAI-compatible tool calling format
- **Effect**: Enables proper parsing of tool call tokens
- **Alternatives**: None for gpt-oss compatibility
### --enable-auto-tool-choice
- **Required**: Yes
- **Purpose**: Allows automatic tool selection
- **Effect**: Model can choose which tool to call
- **Note**: Only `tool_choice='auto'` is supported
### --tool-server
- **Required**: Optional, but needed for demo tools
- **Options**:
- `demo`: Built-in demo tools (browser, Python interpreter)
- `ip:port`: Custom MCP tool server
- Multiple servers: `ip1:port1,ip2:port2`
### --max-model-len
- **Required**: No
- **Purpose**: Sets maximum context length
- **Recommended**: 8192 or higher for tool calling contexts
- **Effect**: Prevents truncation during multi-turn tool conversations
## Tool Server Options
### Demo Tool Server
Requires gpt-oss library:
```bash
pip install gpt-oss
```
Provides:
- Web browser tool
- Python interpreter tool
Start command:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo
```
### MCP Tool Servers
Start vLLM with MCP server URLs:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server localhost:5000,localhost:5001
```
Requirements:
- MCP servers must be running before vLLM starts
- Must implement MCP protocol
- Should return results in expected format
### No Tool Server (Client-Managed Tools)
For client-side tool management (e.g., llama-stack with MCP):
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
```
Tools are provided via API request, not server config.
## Environment Variables
### Search Tools
For demo tool server with search:
```bash
export EXA_API_KEY="your-exa-api-key"
```
### Python Execution
For safe Python execution (recommended for demo):
```bash
export PYTHON_EXECUTION_BACKEND=dangerously_use_uv
```
**Warning**: Demo Python execution is for testing only.
## Client Configuration
### OpenAI-Compatible Clients
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's 2+2?"}],
tools=[{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
},
"required": ["expression"]
}
}
}],
tool_choice="auto" # MUST be 'auto' - only supported value
)
```
### llama-stack Configuration
Example run.yaml for llama-stack with vLLM:
```yaml
inference:
- provider_id: vllm-provider
provider_type: remote::vllm
config:
url: http://localhost:8000/v1
# No auth_credential needed if vLLM has no auth
tool_runtime:
- provider_id: mcp-provider
provider_type: mcp
config:
servers:
- server_name: my-tools
url: http://localhost:5000
```
## Common Configuration Issues
### Issue: tool_choice Not Working
**Symptom**: Error about unsupported tool_choice value
**Solution**: Only use `tool_choice="auto"`, other values not supported:
```python
# GOOD
tool_choice="auto"
# BAD - will fail
tool_choice="required"
tool_choice={"type": "function", "function": {"name": "my_func"}}
```
### Issue: Tools Not Being Called
**Symptoms**:
- Model describes tool usage in text
- No tool_calls in response
- Empty tool_calls array
**Checklist**:
1. Verify `--tool-call-parser openai` flag is set
2. Verify `--enable-auto-tool-choice` flag is set
3. Check generation_config.json is up to date (see model-updates.md)
4. Try simpler tool schemas first
5. Check vLLM logs for parsing errors
### Issue: Token Parsing Errors
**Error**: `openai_harmony.HarmonyError: Unexpected token X`
**Solutions**:
1. Update model files (see model-updates.md)
2. Verify vLLM version is recent
3. Check vLLM startup logs for warnings
4. Restart vLLM server after any config changes
## Performance Tuning
### GPU Memory
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.9
```
### Tensor Parallelism
For multi-GPU:
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2
```
### Batching
```bash
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--max-num-batched-tokens 8192 \
--max-num-seqs 256
```
## Testing Your Configuration
### Basic Test
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "Calculate 15 * 7"}],
"tools": [{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform math",
"parameters": {
"type": "object",
"properties": {"expr": {"type": "string"}}
}
}
}],
"tool_choice": "auto"
}'
```
### Expected Response
Success indicates tool_calls array with function call:
```json
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"expr\": \"15 * 7\"}"
}
}]
}
}]
}
```
### Failure Indicators
- `content` field has text describing calculation instead of null
- `tool_calls` is empty or null
- Error in response about tool parsing
- HarmonyError in vLLM logs
## Cross-References
- Model file updates: See model-updates.md
- Known issues: See known-issues.md