6.5 KiB
6.5 KiB
Tool Calling Configuration for gpt-oss with vLLM
Required vLLM Server Flags
For gpt-oss tool calling to work, vLLM must be started with specific flags.
Minimal Configuration
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
Full Configuration with Tool Server
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo \
--max-model-len 8192 \
--dtype auto
Flag Explanations
--tool-call-parser openai
- Required: Yes
- Purpose: Uses OpenAI-compatible tool calling format
- Effect: Enables proper parsing of tool call tokens
- Alternatives: None for gpt-oss compatibility
--enable-auto-tool-choice
- Required: Yes
- Purpose: Allows automatic tool selection
- Effect: Model can choose which tool to call
- Note: Only
tool_choice='auto'is supported
--tool-server
- Required: Optional, but needed for demo tools
- Options:
demo: Built-in demo tools (browser, Python interpreter)ip:port: Custom MCP tool server- Multiple servers:
ip1:port1,ip2:port2
--max-model-len
- Required: No
- Purpose: Sets maximum context length
- Recommended: 8192 or higher for tool calling contexts
- Effect: Prevents truncation during multi-turn tool conversations
Tool Server Options
Demo Tool Server
Requires gpt-oss library:
pip install gpt-oss
Provides:
- Web browser tool
- Python interpreter tool
Start command:
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server demo
MCP Tool Servers
Start vLLM with MCP server URLs:
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tool-server localhost:5000,localhost:5001
Requirements:
- MCP servers must be running before vLLM starts
- Must implement MCP protocol
- Should return results in expected format
No Tool Server (Client-Managed Tools)
For client-side tool management (e.g., llama-stack with MCP):
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice
Tools are provided via API request, not server config.
Environment Variables
Search Tools
For demo tool server with search:
export EXA_API_KEY="your-exa-api-key"
Python Execution
For safe Python execution (recommended for demo):
export PYTHON_EXECUTION_BACKEND=dangerously_use_uv
Warning: Demo Python execution is for testing only.
Client Configuration
OpenAI-Compatible Clients
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[{"role": "user", "content": "What's 2+2?"}],
tools=[{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
},
"required": ["expression"]
}
}
}],
tool_choice="auto" # MUST be 'auto' - only supported value
)
llama-stack Configuration
Example run.yaml for llama-stack with vLLM:
inference:
- provider_id: vllm-provider
provider_type: remote::vllm
config:
url: http://localhost:8000/v1
# No auth_credential needed if vLLM has no auth
tool_runtime:
- provider_id: mcp-provider
provider_type: mcp
config:
servers:
- server_name: my-tools
url: http://localhost:5000
Common Configuration Issues
Issue: tool_choice Not Working
Symptom: Error about unsupported tool_choice value
Solution: Only use tool_choice="auto", other values not supported:
# GOOD
tool_choice="auto"
# BAD - will fail
tool_choice="required"
tool_choice={"type": "function", "function": {"name": "my_func"}}
Issue: Tools Not Being Called
Symptoms:
- Model describes tool usage in text
- No tool_calls in response
- Empty tool_calls array
Checklist:
- Verify
--tool-call-parser openaiflag is set - Verify
--enable-auto-tool-choiceflag is set - Check generation_config.json is up to date (see model-updates.md)
- Try simpler tool schemas first
- Check vLLM logs for parsing errors
Issue: Token Parsing Errors
Error: openai_harmony.HarmonyError: Unexpected token X
Solutions:
- Update model files (see model-updates.md)
- Verify vLLM version is recent
- Check vLLM startup logs for warnings
- Restart vLLM server after any config changes
Performance Tuning
GPU Memory
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.9
Tensor Parallelism
For multi-GPU:
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--tensor-parallel-size 2
Batching
vllm serve openai/gpt-oss-20b \
--tool-call-parser openai \
--enable-auto-tool-choice \
--max-num-batched-tokens 8192 \
--max-num-seqs 256
Testing Your Configuration
Basic Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "Calculate 15 * 7"}],
"tools": [{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform math",
"parameters": {
"type": "object",
"properties": {"expr": {"type": "string"}}
}
}
}],
"tool_choice": "auto"
}'
Expected Response
Success indicates tool_calls array with function call:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "calculator",
"arguments": "{\"expr\": \"15 * 7\"}"
}
}]
}
}]
}
Failure Indicators
contentfield has text describing calculation instead of nulltool_callsis empty or null- Error in response about tool parsing
- HarmonyError in vLLM logs
Cross-References
- Model file updates: See model-updates.md
- Known issues: See known-issues.md