Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:59:39 +08:00
commit 0b993003eb
9 changed files with 2987 additions and 0 deletions

450
skills/ollama/SKILL.md Normal file
View File

@@ -0,0 +1,450 @@
---
name: ollama
description: Use this if the user wants to connect to Ollama or leverage Ollama in any shape or form inside their project. Guide users integrating Ollama into their projects for local AI inference. Covers installation, connection setup, model management, and API usage for both Python and Node.js. Helps with text generation, chat interfaces, embeddings, streaming responses, and building AI-powered applications using local LLMs.
---
# Ollama
## Overview
This skill helps users integrate Ollama into their projects for running large language models locally. The skill guides users through setup, connection validation, model management, and API integration for both Python and Node.js applications. Ollama provides a simple API for running models like Llama, Mistral, Gemma, and others locally without cloud dependencies.
## When to Use This Skill
Use this skill when users want to:
- Run large language models locally on their machine
- Build AI-powered applications without cloud dependencies
- Implement text generation, chat, or embeddings functionality
- Stream LLM responses in real-time
- Create RAG (Retrieval-Augmented Generation) systems
- Integrate local AI capabilities into Python or Node.js projects
- Manage Ollama models (pull, list, delete)
- Validate Ollama connectivity and troubleshoot connection issues
## Installation and Setup
### Step 1: Collect Ollama URL
**IMPORTANT**: Always ask users for their Ollama URL. Do not assume it's running locally.
Ask the user: "What is your Ollama server URL?"
Common scenarios:
- **Local installation**: `http://localhost:11434` (default)
- **Remote server**: `http://192.168.1.100:11434`
- **Custom port**: `http://localhost:8080`
- **Docker**: `http://localhost:11434` (if port mapped to 11434)
If the user says they're running Ollama locally or doesn't know the URL, suggest trying `http://localhost:11434`.
### Step 2: Check if Ollama is Installed
Before proceeding, verify if Ollama is installed and running at the provided URL. Users can check by visiting the URL in their browser or running:
```bash
curl <OLLAMA_URL>/api/version
```
If Ollama is not installed, guide users to install it:
**macOS/Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
**Windows:**
Download from https://ollama.com/download
**Docker:**
```bash
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```
### Step 3: Start Ollama Service
Ensure Ollama is running:
**macOS/Linux:**
```bash
ollama serve
```
**Docker:**
```bash
docker start ollama
```
The service typically runs at `http://localhost:11434` by default.
### Step 4: Validate Connection
Use the validation script to test connectivity and list available models.
**IMPORTANT**: The script path is relative to the skill directory. When running the script, either:
1. Use the full path from the skill directory (e.g., `/path/to/ollama/scripts/validate_connection.py`)
2. Change to the skill directory first and then run `python scripts/validate_connection.py`
```bash
# Run from the skill directory
cd /path/to/ollama
python scripts/validate_connection.py <OLLAMA_URL>
```
Example with the user's Ollama URL:
```bash
cd /path/to/ollama
python scripts/validate_connection.py http://192.168.1.100:11434
```
The script will:
- Normalize the URL (remove any path components)
- Check if Ollama is accessible
- Display the Ollama version
- List all installed models with sizes
- Provide troubleshooting guidance if connection fails
**Success output:**
```
✓ Connection successful!
URL: http://localhost:11434
Version: Ollama 0.1.0
Models available: 2
Installed models:
- llama3.2 (4.7 GB)
- mistral (7.2 GB)
```
**Failure output:**
```
✗ Connection failed: Connection refused
URL: http://localhost:11434
Troubleshooting:
1. Ensure Ollama is installed and running
2. Check that the URL is correct
3. Verify Ollama is accessible at the specified URL
4. Try: curl http://localhost:11434/api/version
```
## Model Management
### Pulling Models
Help users download models from the Ollama library. Common models include:
- `llama3.2` - Meta's Llama 3.2 (various sizes: 1B, 3B)
- `llama3.1` - Meta's Llama 3.1 (8B, 70B, 405B)
- `mistral` - Mistral 7B
- `phi3` - Microsoft Phi-3
- `gemma2` - Google Gemma 2
Users can pull models using:
```bash
ollama pull llama3.2
```
Or programmatically using the API (examples in reference docs).
### Listing Models
Guide users to list installed models:
```bash
ollama list
```
Or use the validation script to see models with detailed information.
### Removing Models
Help users delete models to free space:
```bash
ollama rm llama3.2
```
### Model Selection Guidance
Help users choose appropriate models based on their needs:
- **Small models (1-3B)**: Fast, good for simple tasks, lower resource requirements
- **Medium models (7-13B)**: Balanced performance and quality
- **Large models (70B+)**: Best quality, require significant resources
## Implementation Guidance
### Python Projects
For Python-based projects, refer to the Python API reference:
- **File**: `references/python_api.md`
- **Usage**: Load this reference when implementing Python integrations
- **Contains**:
- REST API examples using `urllib.request` (standard library)
- Text generation with the Generate API
- Conversational interfaces with the Chat API
- **Streaming responses for real-time output (RECOMMENDED)**
- Embeddings for semantic search
- Complete RAG system example
- Error handling patterns
- PEP 723 inline script metadata for dependencies
- **No dependencies required**: Uses only Python standard library
**IMPORTANT**: When creating Python scripts for users, include PEP 723 inline script metadata to declare dependencies. See the reference docs for examples.
**DEFAULT TO STREAMING**: When implementing text generation or chat, use streaming responses unless the user explicitly requests non-streaming.
Common Python use cases:
```python
# Streaming text generation (RECOMMENDED)
for token in generate_stream("Explain quantum computing"):
print(token, end="", flush=True)
# Streaming chat conversation (RECOMMENDED)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
for token in chat_stream(messages):
print(token, end="", flush=True)
# Non-streaming (use only when needed)
response = generate("Explain quantum computing")
# Embeddings for semantic search
embedding = get_embeddings("Hello, world!")
```
### Node.js Projects
For Node.js-based projects, refer to the Node.js API reference:
- **File**: `references/nodejs_api.md`
- **Usage**: Load this reference when implementing Node.js integrations
- **Contains**:
- Official `ollama` npm package examples
- Alternative Fetch API examples (Node.js 18+)
- Text generation and chat APIs
- **Streaming with async iterators (RECOMMENDED)**
- Embeddings and semantic similarity
- Complete RAG system example
- Error handling and retry logic
- TypeScript support examples
Installation:
```bash
npm install ollama
```
**DEFAULT TO STREAMING**: When implementing text generation or chat, use streaming responses unless the user explicitly requests non-streaming.
Common Node.js use cases:
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
// Streaming text generation (RECOMMENDED)
const stream = await ollama.generate({
model: 'llama3.2',
prompt: 'Explain quantum computing',
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.response);
}
// Streaming chat conversation (RECOMMENDED)
const chatStream = await ollama.chat({
model: 'llama3.2',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
],
stream: true
});
for await (const chunk of chatStream) {
process.stdout.write(chunk.message.content);
}
// Non-streaming (use only when needed)
const response = await ollama.generate({
model: 'llama3.2',
prompt: 'Explain quantum computing'
});
// Embeddings
const embedding = await ollama.embeddings({
model: 'llama3.2',
prompt: 'Hello, world!'
});
```
## Common Integration Patterns
### Text Generation
Generate text completions from prompts. Use cases:
- Content generation
- Code completion
- Question answering
- Summarization
Guide users to use the Generate API with appropriate parameters (temperature, top_p, etc.) for their use case.
### Conversational Interfaces
Build chat applications with conversation history. Use cases:
- Chatbots
- Virtual assistants
- Customer support
- Interactive tutorials
Guide users to use the Chat API with message history management. Explain the importance of system prompts for behavior control.
### Embeddings & Semantic Search
Generate vector embeddings for text. Use cases:
- Semantic search
- Document similarity
- RAG systems
- Recommendation systems
Guide users to use the Embeddings API and implement cosine similarity for comparing embeddings.
### Streaming Responses
**RECOMMENDED APPROACH**: Always prefer streaming for better user experience.
Stream LLM output token-by-token. Use cases:
- Real-time chat interfaces
- Progressive content generation
- Better user experience for long outputs
- Immediate feedback to users
**When creating code for users, default to streaming API** unless they specifically request non-streaming responses.
Guide users to:
- Enable `stream: true` in API calls
- Handle async iteration (Node.js) or generators (Python)
- Display tokens as they arrive for real-time feedback
- Show progress indicators during generation
### RAG (Retrieval-Augmented Generation)
Combine document retrieval with generation. Use cases:
- Question answering over documents
- Knowledge base chatbots
- Context-aware assistance
Guide users to:
1. Generate embeddings for documents
2. Store embeddings with associated text
3. Search for relevant documents using query embeddings
4. Inject retrieved context into prompts
5. Generate answers with context
Both reference docs include complete RAG system examples.
## Best Practices
### Security
- Never hardcode sensitive information
- Use environment variables for configuration
- Validate and sanitize user inputs before sending to LLM
### Performance
- Use streaming for long responses to improve perceived performance
- Cache embeddings for documents that don't change
- Choose appropriate model sizes for your use case
- Consider response time requirements when selecting models
### Error Handling
- Always implement proper error handling for network failures
- Check model availability before making requests
- Provide helpful error messages to users
- Implement retry logic for transient failures
### Connection Management
- Validate connections before proceeding with implementation
- Handle connection timeouts gracefully
- For remote Ollama instances, ensure network accessibility
- Use the validation script during development
### Model Management
- Check available disk space before pulling large models
- Keep only models you actively use
- Inform users about model download sizes
- Provide model selection guidance based on requirements
### Context Management
- For chat applications, manage conversation history to avoid token limits
- Trim old messages when conversations get too long
- Consider using summarization for long conversation histories
## Troubleshooting
### Connection Issues
If connection fails:
1. Verify Ollama is installed: `ollama --version`
2. Check if Ollama is running: `curl http://localhost:11434/api/version`
3. Restart Ollama service: `ollama serve`
4. Check firewall settings for remote connections
5. Verify the URL format (should be `http://host:port` with no path)
### Model Not Found
If model is not available:
1. List installed models: `ollama list`
2. Pull the required model: `ollama pull model-name`
3. Verify model name spelling (case-sensitive)
### Out of Memory
If running out of memory:
1. Use a smaller model variant
2. Close other applications
3. Increase system swap space
4. Consider using a machine with more RAM
### Slow Performance
If responses are slow:
1. Use a smaller model
2. Reduce `num_predict` parameter
3. Check CPU/GPU usage
4. Ensure Ollama is using GPU if available
5. Close other resource-intensive applications
## Resources
### scripts/validate_connection.py
Python script to validate Ollama connection and list available models. Normalizes URLs, tests connectivity, displays version information, and provides troubleshooting guidance.
### references/python_api.md
Comprehensive Python API reference with examples for:
- Installation and setup
- Connection verification
- Model management (list, pull, delete)
- Generate API for text completion
- Chat API for conversations
- Streaming responses
- Embeddings and semantic search
- Complete RAG system implementation
- Error handling patterns
- Best practices
### references/nodejs_api.md
Comprehensive Node.js API reference with examples for:
- Installation using npm
- Official `ollama` package usage
- Alternative Fetch API examples
- Model management
- Generate and Chat APIs
- Streaming with async iterators
- Embeddings and semantic similarity
- Complete RAG system implementation
- Error handling and retry logic
- TypeScript support
- Best practices

View File

@@ -0,0 +1,507 @@
# Ollama Node.js API Reference
This reference provides comprehensive examples for integrating Ollama into Node.js projects using the official `ollama` npm package.
**IMPORTANT**: Always use streaming responses for better user experience.
## Table of Contents
1. [Package Setup](#package-setup)
2. [Installation & Setup](#installation--setup)
3. [Verifying Ollama Connection](#verifying-ollama-connection)
4. [Model Selection](#model-selection)
5. [Generate API (Text Completion)](#generate-api-text-completion)
6. [Chat API (Conversational)](#chat-api-conversational)
7. [Embeddings](#embeddings)
8. [Error Handling](#error-handling)
## Package Setup
### ES Modules (package.json)
When creating Node.js scripts for users, always use ES modules. Create a `package.json` with:
```json
{
"type": "module",
"dependencies": {
"ollama": "^0.5.0"
}
}
```
This allows using modern `import` syntax instead of `require`.
### Running Scripts
```bash
# Install dependencies
npm install
# Run script
node script.js
```
## Installation & Setup
### Installation
```bash
npm install ollama
```
### Import
```javascript
import { Ollama } from 'ollama';
```
### Configuration
**IMPORTANT**: Always ask users for their Ollama URL. Do not assume localhost.
```javascript
import { Ollama } from 'ollama';
// Create client with custom URL
const ollama = new Ollama({ host: 'http://localhost:11434' });
// Or for remote Ollama instance
// const ollama = new Ollama({ host: 'http://192.168.1.100:11434' });
```
## Verifying Ollama Connection
### Check Connection (Development)
During development, verify Ollama is running and check available models using curl:
```bash
# Check Ollama is running and get version
curl http://localhost:11434/api/version
# List available models
curl http://localhost:11434/api/tags
```
### Check Ollama Version (Node.js)
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function checkOllama() {
try {
// Simple way to verify connection
const models = await ollama.list();
console.log('✓ Connected to Ollama');
console.log(` Available models: ${models.models.length}`);
return true;
} catch (error) {
console.log(`✗ Failed to connect to Ollama: ${error.message}`);
return false;
}
}
// Usage
await checkOllama();
```
## Model Selection
**IMPORTANT**: Always ask users which model they want to use. Don't assume a default.
### Listing Available Models
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function listAvailableModels() {
const { models } = await ollama.list();
return models.map(m => m.name);
}
// Usage - show available models to user
const available = await listAvailableModels();
console.log('Available models:');
available.forEach(model => {
console.log(` - ${model}`);
});
```
### Finding Models
If the user doesn't have a model installed or wants to use a different one:
- **Browse models**: Direct them to https://ollama.com/search
- **Popular choices**: llama3.2, llama3.1, mistral, phi3, qwen2.5
- **Specialized models**: codellama (coding), llava (vision), nomic-embed-text (embeddings)
### Model Selection Flow
```javascript
async function selectModel() {
const available = await listAvailableModels();
if (available.length === 0) {
console.log('No models installed!');
console.log('Visit https://ollama.com/search to find models');
console.log('Then run: ollama pull <model-name>');
return null;
}
console.log('Available models:');
available.forEach((model, i) => {
console.log(` ${i + 1}. ${model}`);
});
// In practice, you'd ask the user to choose
return available[0]; // Default to first available
}
```
## Generate API (Text Completion)
### Streaming Text Generation
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function generateStream(prompt, model = 'llama3.2') {
const response = await ollama.generate({
model: model,
prompt: prompt,
stream: true
});
for await (const chunk of response) {
process.stdout.write(chunk.response);
}
}
// Usage
process.stdout.write('Response: ');
await generateStream('Why is the sky blue?', 'llama3.2');
process.stdout.write('\n');
```
### With Options (Temperature, Top-P, etc.)
```javascript
async function generateWithOptions(prompt, model = 'llama3.2') {
const response = await ollama.generate({
model: model,
prompt: prompt,
stream: true,
options: {
temperature: 0.7,
top_p: 0.9,
top_k: 40,
num_predict: 100 // Max tokens
}
});
for await (const chunk of response) {
process.stdout.write(chunk.response);
}
}
// Usage
process.stdout.write('Response: ');
await generateWithOptions('Write a haiku about programming');
process.stdout.write('\n');
```
## Chat API (Conversational)
### Streaming Chat
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function chatStream(messages, model = 'llama3.2') {
/*
* Chat with a model using conversation history with streaming.
*
* Args:
* messages: Array of message objects with 'role' and 'content'
* role can be 'system', 'user', or 'assistant'
*/
const response = await ollama.chat({
model: model,
messages: messages,
stream: true
});
for await (const chunk of response) {
process.stdout.write(chunk.message.content);
}
}
// Usage
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
];
process.stdout.write('Response: ');
await chatStream(messages);
process.stdout.write('\n');
```
### Multi-turn Conversation
```javascript
import * as readline from 'readline';
async function conversationLoop(model = 'llama3.2') {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' }
];
const askQuestion = () => {
rl.question('\nYou: ', async (input) => {
if (input.toLowerCase() === 'exit' || input.toLowerCase() === 'quit') {
rl.close();
return;
}
// Add user message
messages.push({ role: 'user', content: input });
// Stream response
process.stdout.write('Assistant: ');
let fullResponse = '';
const response = await ollama.chat({
model: model,
messages: messages,
stream: true
});
for await (const chunk of response) {
const content = chunk.message.content;
process.stdout.write(content);
fullResponse += content;
}
process.stdout.write('\n');
// Add assistant response to history
messages.push({ role: 'assistant', content: fullResponse });
askQuestion();
});
};
askQuestion();
}
// Usage
await conversationLoop();
```
## Embeddings
### Generate Embeddings
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function getEmbeddings(text, model = 'nomic-embed-text') {
/*
* Generate embeddings for text.
*
* Note: Use an embedding-specific model like 'nomic-embed-text'
* Regular models can generate embeddings, but dedicated models work better.
*/
const response = await ollama.embeddings({
model: model,
prompt: text
});
return response.embedding;
}
// Usage
const embedding = await getEmbeddings('Hello, world!');
console.log(`Embedding dimension: ${embedding.length}`);
console.log(`First 5 values: ${embedding.slice(0, 5)}`);
```
### Semantic Similarity
```javascript
function cosineSimilarity(vec1, vec2) {
const dotProduct = vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
const magnitude1 = Math.sqrt(vec1.reduce((sum, val) => sum + val * val, 0));
const magnitude2 = Math.sqrt(vec2.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitude1 * magnitude2);
}
// Usage
const text1 = 'The cat sat on the mat';
const text2 = 'A feline rested on a rug';
const text3 = 'JavaScript is a programming language';
const emb1 = await getEmbeddings(text1);
const emb2 = await getEmbeddings(text2);
const emb3 = await getEmbeddings(text3);
console.log(`Similarity 1-2: ${cosineSimilarity(emb1, emb2).toFixed(3)}`); // High
console.log(`Similarity 1-3: ${cosineSimilarity(emb1, emb3).toFixed(3)}`); // Low
```
## Error Handling
### Comprehensive Error Handling
```javascript
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function* safeGenerateStream(prompt, model = 'llama3.2') {
try {
const response = await ollama.generate({
model: model,
prompt: prompt,
stream: true
});
for await (const chunk of response) {
yield chunk.response;
}
} catch (error) {
// Model not found or other API errors
if (error.message.toLowerCase().includes('not found')) {
console.log(`\n✗ Model '${model}' not found`);
console.log(` Run: ollama pull ${model}`);
console.log(` Or browse models at: https://ollama.com/search`);
} else if (error.code === 'ECONNREFUSED') {
console.log('\n✗ Connection failed. Is Ollama running?');
console.log(' Start Ollama with: ollama serve');
} else {
console.log(`\n✗ Unexpected error: ${error.message}`);
}
}
}
// Usage
process.stdout.write('Response: ');
for await (const token of safeGenerateStream('Hello, world!', 'llama3.2')) {
process.stdout.write(token);
}
process.stdout.write('\n');
```
### Checking Model Availability
```javascript
async function ensureModelAvailable(model) {
try {
const { models } = await ollama.list();
const modelNames = models.map(m => m.name);
if (!modelNames.includes(model)) {
console.log(`Model '${model}' not available locally`);
console.log(`Available models: ${modelNames.join(', ')}`);
console.log(`\nTo download: ollama pull ${model}`);
console.log(`Browse models: https://ollama.com/search`);
return false;
}
return true;
} catch (error) {
console.log(`Failed to check models: ${error.message}`);
return false;
}
}
// Usage
if (await ensureModelAvailable('llama3.2')) {
// Proceed with using the model
}
```
## Best Practices
1. **Always Use Streaming**: Stream responses for better user experience
2. **Ask About Models**: Don't assume models - ask users which model they want to use
3. **Verify Connection**: Check Ollama connection during development with curl
4. **Error Handling**: Handle model not found and connection errors gracefully
5. **Context Management**: Manage conversation history to avoid token limits
6. **Model Selection**: Direct users to https://ollama.com/search to find models
7. **Custom Hosts**: Always ask users for their Ollama URL, don't assume localhost
8. **ES Modules**: Use `"type": "module"` in package.json for modern import syntax
## Complete Example Script
```javascript
// script.js
import { Ollama } from 'ollama';
const ollama = new Ollama();
async function main() {
const model = 'llama3.2';
// Check connection
try {
await ollama.list();
} catch (error) {
console.log(`Error: Cannot connect to Ollama - ${error.message}`);
console.log('Make sure Ollama is running: ollama serve');
return;
}
// Stream a response
console.log('Asking about JavaScript...\n');
const response = await ollama.generate({
model: model,
prompt: 'Explain JavaScript in one sentence',
stream: true
});
process.stdout.write('Response: ');
for await (const chunk of response) {
process.stdout.write(chunk.response);
}
process.stdout.write('\n');
}
main();
```
### package.json
```json
{
"type": "module",
"dependencies": {
"ollama": "^0.5.0"
}
}
```
### Running
```bash
npm install
node script.js
```

View File

@@ -0,0 +1,454 @@
# Ollama Python API Reference
This reference provides comprehensive examples for integrating Ollama into Python projects using the official `ollama` Python library.
**IMPORTANT**: Always use streaming responses for better user experience.
## Table of Contents
1. [PEP 723 Inline Script Metadata](#pep-723-inline-script-metadata)
2. [Installation & Setup](#installation--setup)
3. [Verifying Ollama Connection](#verifying-ollama-connection)
4. [Model Selection](#model-selection)
5. [Generate API (Text Completion)](#generate-api-text-completion)
6. [Chat API (Conversational)](#chat-api-conversational)
7. [Embeddings](#embeddings)
8. [Error Handling](#error-handling)
## Installation & Setup
### Installation
```bash
pip install ollama
```
### Import
```python
import ollama
```
### Configuration
**IMPORTANT**: Always ask users for their Ollama URL. Do not assume localhost.
```python
# Create client with custom URL
client = ollama.Client(host='http://localhost:11434')
# Or for remote Ollama instance
# client = ollama.Client(host='http://192.168.1.100:11434')
```
## Verifying Ollama Connection
### Check Connection (Development)
During development, verify Ollama is running and check available models using curl:
```bash
# Check Ollama is running and get version
curl http://localhost:11434/api/version
# List available models
curl http://localhost:11434/api/tags
```
### Check Ollama Version (Python)
```python
import ollama
def check_ollama():
"""Check if Ollama is running."""
try:
# Simple way to verify connection
models = ollama.list()
print(f"✓ Connected to Ollama")
print(f" Available models: {len(models.get('models', []))}")
return True
except Exception as e:
print(f"✗ Failed to connect to Ollama: {e}")
return False
# Usage
check_ollama()
```
## Model Selection
**IMPORTANT**: Always ask users which model they want to use. Don't assume a default.
### Listing Available Models
```python
import ollama
def list_available_models():
"""List all locally installed models."""
models = ollama.list()
return [model['name'] for model in models.get('models', [])]
# Usage - show available models to user
available = list_available_models()
print("Available models:")
for model in available:
print(f" - {model}")
```
### Finding Models
If the user doesn't have a model installed or wants to use a different one:
- **Browse models**: Direct them to https://ollama.com/search
- **Popular choices**: llama3.2, llama3.1, mistral, phi3, qwen2.5
- **Specialized models**: codellama (coding), llava (vision), nomic-embed-text (embeddings)
### Model Selection Flow
```python
def select_model():
"""Interactive model selection."""
available = list_available_models()
if not available:
print("No models installed!")
print("Visit https://ollama.com/search to find models")
print("Then run: ollama pull <model-name>")
return None
print("Available models:")
for i, model in enumerate(available, 1):
print(f" {i}. {model}")
# In practice, you'd ask the user to choose
return available[0] # Default to first available
```
## Generate API (Text Completion)
### Streaming Text Generation
```python
import ollama
def generate_stream(prompt, model="llama3.2"):
"""Generate text with streaming (yields tokens as they arrive)."""
stream = ollama.generate(
model=model,
prompt=prompt,
stream=True
)
for chunk in stream:
yield chunk['response']
# Usage
print("Response: ", end="", flush=True)
for token in generate_stream("Why is the sky blue?", model="llama3.2"):
print(token, end="", flush=True)
print()
```
### With Options (Temperature, Top-P, etc.)
```python
def generate_with_options(prompt, model="llama3.2"):
"""Generate with custom sampling parameters."""
stream = ollama.generate(
model=model,
prompt=prompt,
stream=True,
options={
'temperature': 0.7,
'top_p': 0.9,
'top_k': 40,
'num_predict': 100 # Max tokens
}
)
for chunk in stream:
yield chunk['response']
# Usage
print("Response: ", end="", flush=True)
for token in generate_with_options("Write a haiku about programming"):
print(token, end="", flush=True)
print()
```
## Chat API (Conversational)
### Streaming Chat
```python
import ollama
def chat_stream(messages, model="llama3.2"):
"""
Chat with a model using conversation history with streaming.
Args:
messages: List of message dicts with 'role' and 'content'
role can be 'system', 'user', or 'assistant'
"""
stream = ollama.chat(
model=model,
messages=messages,
stream=True
)
for chunk in stream:
yield chunk['message']['content']
# Usage
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
print("Response: ", end="", flush=True)
for token in chat_stream(messages):
print(token, end="", flush=True)
print()
```
### Multi-turn Conversation
```python
def conversation_loop(model="llama3.2"):
"""Interactive chat loop with streaming responses."""
messages = [
{"role": "system", "content": "You are a helpful assistant."}
]
while True:
user_input = input("\nYou: ")
if user_input.lower() in ['exit', 'quit']:
break
# Add user message
messages.append({"role": "user", "content": user_input})
# Stream response
print("Assistant: ", end="", flush=True)
full_response = ""
for token in chat_stream(messages, model):
print(token, end="", flush=True)
full_response += token
print()
# Add assistant response to history
messages.append({"role": "assistant", "content": full_response})
# Usage
conversation_loop()
```
## Embeddings
### Generate Embeddings
```python
import ollama
def get_embeddings(text, model="nomic-embed-text"):
"""
Generate embeddings for text.
Note: Use an embedding-specific model like 'nomic-embed-text'
Regular models can generate embeddings, but dedicated models work better.
"""
response = ollama.embeddings(
model=model,
prompt=text
)
return response['embedding']
# Usage
embedding = get_embeddings("Hello, world!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
```
### Semantic Similarity
```python
import math
def cosine_similarity(vec1, vec2):
"""Calculate cosine similarity between two vectors."""
dot_product = sum(a * b for a, b in zip(vec1, vec2))
magnitude1 = math.sqrt(sum(a * a for a in vec1))
magnitude2 = math.sqrt(sum(b * b for b in vec2))
return dot_product / (magnitude1 * magnitude2)
# Usage
text1 = "The cat sat on the mat"
text2 = "A feline rested on a rug"
text3 = "Python is a programming language"
emb1 = get_embeddings(text1)
emb2 = get_embeddings(text2)
emb3 = get_embeddings(text3)
print(f"Similarity 1-2: {cosine_similarity(emb1, emb2):.3f}") # High
print(f"Similarity 1-3: {cosine_similarity(emb1, emb3):.3f}") # Low
```
## Error Handling
### Comprehensive Error Handling
```python
import ollama
def safe_generate_stream(prompt, model="llama3.2"):
"""Generate with comprehensive error handling."""
try:
stream = ollama.generate(
model=model,
prompt=prompt,
stream=True
)
for chunk in stream:
yield chunk['response']
except ollama.ResponseError as e:
# Model not found or other API errors
if "not found" in str(e).lower():
print(f"\n✗ Model '{model}' not found")
print(f" Run: ollama pull {model}")
print(f" Or browse models at: https://ollama.com/search")
else:
print(f"\n✗ API Error: {e}")
except ConnectionError:
print("\n✗ Connection failed. Is Ollama running?")
print(" Start Ollama with: ollama serve")
except Exception as e:
print(f"\n✗ Unexpected error: {e}")
# Usage
print("Response: ", end="", flush=True)
for token in safe_generate_stream("Hello, world!", model="llama3.2"):
print(token, end="", flush=True)
print()
```
### Checking Model Availability
```python
def ensure_model_available(model):
"""Check if model is available, provide guidance if not."""
try:
available = ollama.list()
model_names = [m['name'] for m in available.get('models', [])]
if model not in model_names:
print(f"Model '{model}' not available locally")
print(f"Available models: {', '.join(model_names)}")
print(f"\nTo download: ollama pull {model}")
print(f"Browse models: https://ollama.com/search")
return False
return True
except Exception as e:
print(f"Failed to check models: {e}")
return False
# Usage
if ensure_model_available("llama3.2"):
# Proceed with using the model
pass
```
## Best Practices
1. **Always Use Streaming**: Stream responses for better user experience
2. **Ask About Models**: Don't assume models - ask users which model they want to use
3. **Verify Connection**: Check Ollama connection during development with curl
4. **Error Handling**: Handle model not found and connection errors gracefully
5. **Context Management**: Manage conversation history to avoid token limits
6. **Model Selection**: Direct users to https://ollama.com/search to find models
7. **Custom Hosts**: Always ask users for their Ollama URL, don't assume localhost
## PEP 723 Inline Script Metadata
When creating standalone Python scripts for users, always include inline script metadata at the top of the file using PEP 723 format. This allows tools like `uv` and `pipx` to automatically manage dependencies.
### Format
```python
# /// script
# requires-python = ">=3.8"
# dependencies = [
# "ollama>=0.1.0",
# ]
# ///
import ollama
# Your code here
```
### Running Scripts
Users can run scripts with PEP 723 metadata using:
```bash
# Using uv (recommended)
uv run script.py
# Using pipx
pipx run script.py
# Traditional approach
pip install ollama
python script.py
```
### Complete Example Script
```python
# /// script
# requires-python = ">=3.8"
# dependencies = [
# "ollama>=0.1.0",
# ]
# ///
import ollama
def main():
"""Simple streaming chat example."""
model = "llama3.2"
# Check connection
try:
ollama.list()
except Exception as e:
print(f"Error: Cannot connect to Ollama - {e}")
print("Make sure Ollama is running: ollama serve")
return
# Stream a response
print("Asking about Python...\n")
stream = ollama.generate(
model=model,
prompt="Explain Python in one sentence",
stream=True
)
print("Response: ", end="", flush=True)
for chunk in stream:
print(chunk['response'], end="", flush=True)
print()
if __name__ == "__main__":
main()
```