Files
gh-balloob-llm-skills-smart…/skills/ollama/references/python_api.md
2025-11-29 17:59:39 +08:00

11 KiB

Ollama Python API Reference

This reference provides comprehensive examples for integrating Ollama into Python projects using the official ollama Python library.

IMPORTANT: Always use streaming responses for better user experience.

Table of Contents

  1. PEP 723 Inline Script Metadata
  2. Installation & Setup
  3. Verifying Ollama Connection
  4. Model Selection
  5. Generate API (Text Completion)
  6. Chat API (Conversational)
  7. Embeddings
  8. Error Handling

Installation & Setup

Installation

pip install ollama

Import

import ollama

Configuration

IMPORTANT: Always ask users for their Ollama URL. Do not assume localhost.

# Create client with custom URL
client = ollama.Client(host='http://localhost:11434')

# Or for remote Ollama instance
# client = ollama.Client(host='http://192.168.1.100:11434')

Verifying Ollama Connection

Check Connection (Development)

During development, verify Ollama is running and check available models using curl:

# Check Ollama is running and get version
curl http://localhost:11434/api/version

# List available models
curl http://localhost:11434/api/tags

Check Ollama Version (Python)

import ollama

def check_ollama():
    """Check if Ollama is running."""
    try:
        # Simple way to verify connection
        models = ollama.list()
        print(f"✓ Connected to Ollama")
        print(f"  Available models: {len(models.get('models', []))}")
        return True
    except Exception as e:
        print(f"✗ Failed to connect to Ollama: {e}")
        return False

# Usage
check_ollama()

Model Selection

IMPORTANT: Always ask users which model they want to use. Don't assume a default.

Listing Available Models

import ollama

def list_available_models():
    """List all locally installed models."""
    models = ollama.list()
    return [model['name'] for model in models.get('models', [])]

# Usage - show available models to user
available = list_available_models()
print("Available models:")
for model in available:
    print(f"  - {model}")

Finding Models

If the user doesn't have a model installed or wants to use a different one:

  • Browse models: Direct them to https://ollama.com/search
  • Popular choices: llama3.2, llama3.1, mistral, phi3, qwen2.5
  • Specialized models: codellama (coding), llava (vision), nomic-embed-text (embeddings)

Model Selection Flow

def select_model():
    """Interactive model selection."""
    available = list_available_models()

    if not available:
        print("No models installed!")
        print("Visit https://ollama.com/search to find models")
        print("Then run: ollama pull <model-name>")
        return None

    print("Available models:")
    for i, model in enumerate(available, 1):
        print(f"  {i}. {model}")

    # In practice, you'd ask the user to choose
    return available[0]  # Default to first available

Generate API (Text Completion)

Streaming Text Generation

import ollama

def generate_stream(prompt, model="llama3.2"):
    """Generate text with streaming (yields tokens as they arrive)."""
    stream = ollama.generate(
        model=model,
        prompt=prompt,
        stream=True
    )

    for chunk in stream:
        yield chunk['response']

# Usage
print("Response: ", end="", flush=True)
for token in generate_stream("Why is the sky blue?", model="llama3.2"):
    print(token, end="", flush=True)
print()

With Options (Temperature, Top-P, etc.)

def generate_with_options(prompt, model="llama3.2"):
    """Generate with custom sampling parameters."""
    stream = ollama.generate(
        model=model,
        prompt=prompt,
        stream=True,
        options={
            'temperature': 0.7,
            'top_p': 0.9,
            'top_k': 40,
            'num_predict': 100  # Max tokens
        }
    )

    for chunk in stream:
        yield chunk['response']

# Usage
print("Response: ", end="", flush=True)
for token in generate_with_options("Write a haiku about programming"):
    print(token, end="", flush=True)
print()

Chat API (Conversational)

Streaming Chat

import ollama

def chat_stream(messages, model="llama3.2"):
    """
    Chat with a model using conversation history with streaming.

    Args:
        messages: List of message dicts with 'role' and 'content'
                 role can be 'system', 'user', or 'assistant'
    """
    stream = ollama.chat(
        model=model,
        messages=messages,
        stream=True
    )

    for chunk in stream:
        yield chunk['message']['content']

# Usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

print("Response: ", end="", flush=True)
for token in chat_stream(messages):
    print(token, end="", flush=True)
print()

Multi-turn Conversation

def conversation_loop(model="llama3.2"):
    """Interactive chat loop with streaming responses."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ['exit', 'quit']:
            break

        # Add user message
        messages.append({"role": "user", "content": user_input})

        # Stream response
        print("Assistant: ", end="", flush=True)
        full_response = ""
        for token in chat_stream(messages, model):
            print(token, end="", flush=True)
            full_response += token
        print()

        # Add assistant response to history
        messages.append({"role": "assistant", "content": full_response})

# Usage
conversation_loop()

Embeddings

Generate Embeddings

import ollama

def get_embeddings(text, model="nomic-embed-text"):
    """
    Generate embeddings for text.

    Note: Use an embedding-specific model like 'nomic-embed-text'
    Regular models can generate embeddings, but dedicated models work better.
    """
    response = ollama.embeddings(
        model=model,
        prompt=text
    )
    return response['embedding']

# Usage
embedding = get_embeddings("Hello, world!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Semantic Similarity

import math

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    magnitude1 = math.sqrt(sum(a * a for a in vec1))
    magnitude2 = math.sqrt(sum(b * b for b in vec2))
    return dot_product / (magnitude1 * magnitude2)

# Usage
text1 = "The cat sat on the mat"
text2 = "A feline rested on a rug"
text3 = "Python is a programming language"

emb1 = get_embeddings(text1)
emb2 = get_embeddings(text2)
emb3 = get_embeddings(text3)

print(f"Similarity 1-2: {cosine_similarity(emb1, emb2):.3f}")  # High
print(f"Similarity 1-3: {cosine_similarity(emb1, emb3):.3f}")  # Low

Error Handling

Comprehensive Error Handling

import ollama

def safe_generate_stream(prompt, model="llama3.2"):
    """Generate with comprehensive error handling."""
    try:
        stream = ollama.generate(
            model=model,
            prompt=prompt,
            stream=True
        )

        for chunk in stream:
            yield chunk['response']

    except ollama.ResponseError as e:
        # Model not found or other API errors
        if "not found" in str(e).lower():
            print(f"\n✗ Model '{model}' not found")
            print(f"  Run: ollama pull {model}")
            print(f"  Or browse models at: https://ollama.com/search")
        else:
            print(f"\n✗ API Error: {e}")

    except ConnectionError:
        print("\n✗ Connection failed. Is Ollama running?")
        print("  Start Ollama with: ollama serve")

    except Exception as e:
        print(f"\n✗ Unexpected error: {e}")

# Usage
print("Response: ", end="", flush=True)
for token in safe_generate_stream("Hello, world!", model="llama3.2"):
    print(token, end="", flush=True)
print()

Checking Model Availability

def ensure_model_available(model):
    """Check if model is available, provide guidance if not."""
    try:
        available = ollama.list()
        model_names = [m['name'] for m in available.get('models', [])]

        if model not in model_names:
            print(f"Model '{model}' not available locally")
            print(f"Available models: {', '.join(model_names)}")
            print(f"\nTo download: ollama pull {model}")
            print(f"Browse models: https://ollama.com/search")
            return False

        return True

    except Exception as e:
        print(f"Failed to check models: {e}")
        return False

# Usage
if ensure_model_available("llama3.2"):
    # Proceed with using the model
    pass

Best Practices

  1. Always Use Streaming: Stream responses for better user experience
  2. Ask About Models: Don't assume models - ask users which model they want to use
  3. Verify Connection: Check Ollama connection during development with curl
  4. Error Handling: Handle model not found and connection errors gracefully
  5. Context Management: Manage conversation history to avoid token limits
  6. Model Selection: Direct users to https://ollama.com/search to find models
  7. Custom Hosts: Always ask users for their Ollama URL, don't assume localhost

PEP 723 Inline Script Metadata

When creating standalone Python scripts for users, always include inline script metadata at the top of the file using PEP 723 format. This allows tools like uv and pipx to automatically manage dependencies.

Format

# /// script
# requires-python = ">=3.8"
# dependencies = [
#   "ollama>=0.1.0",
# ]
# ///

import ollama

# Your code here

Running Scripts

Users can run scripts with PEP 723 metadata using:

# Using uv (recommended)
uv run script.py

# Using pipx
pipx run script.py

# Traditional approach
pip install ollama
python script.py

Complete Example Script

# /// script
# requires-python = ">=3.8"
# dependencies = [
#   "ollama>=0.1.0",
# ]
# ///

import ollama

def main():
    """Simple streaming chat example."""
    model = "llama3.2"

    # Check connection
    try:
        ollama.list()
    except Exception as e:
        print(f"Error: Cannot connect to Ollama - {e}")
        print("Make sure Ollama is running: ollama serve")
        return

    # Stream a response
    print("Asking about Python...\n")
    stream = ollama.generate(
        model=model,
        prompt="Explain Python in one sentence",
        stream=True
    )

    print("Response: ", end="", flush=True)
    for chunk in stream:
        print(chunk['response'], end="", flush=True)
    print()

if __name__ == "__main__":
    main()