gh-k-dense-ai-claude-scient…/skills/transformers/references/models.md

# Model Loading and Management

## Overview

The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.

## Loading Models

### AutoModel Classes

Use AutoModel classes for automatic architecture selection:

```python
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```

### Common AutoModel Classes

**NLP Tasks:**
- `AutoModelForSequenceClassification`: Text classification, sentiment analysis
- `AutoModelForTokenClassification`: NER, POS tagging
- `AutoModelForQuestionAnswering`: Extractive QA
- `AutoModelForCausalLM`: Text generation (GPT, Llama)
- `AutoModelForMaskedLM`: Masked language modeling (BERT)
- `AutoModelForSeq2SeqLM`: Translation, summarization (T5, BART)

**Vision Tasks:**
- `AutoModelForImageClassification`: Image classification
- `AutoModelForObjectDetection`: Object detection
- `AutoModelForImageSegmentation`: Image segmentation

**Audio Tasks:**
- `AutoModelForAudioClassification`: Audio classification
- `AutoModelForSpeechSeq2Seq`: Speech recognition

**Multimodal:**
- `AutoModelForVision2Seq`: Image captioning, VQA

## Loading Parameters

### Basic Parameters

**pretrained_model_name_or_path**: Model identifier or local path
```python
model = AutoModel.from_pretrained("bert-base-uncased")  # From Hub
model = AutoModel.from_pretrained("./local/model/path")  # From disk
```

**num_labels**: Number of output labels for classification
```python
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)
```

**cache_dir**: Custom cache location
```python
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
```

### Device Management

**device_map**: Automatic device allocation for large models
```python
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="sequential"
)

# Custom device map
device_map = {
    "transformer.layers.0": 0,      # GPU 0
    "transformer.layers.1": 1,      # GPU 1
    "transformer.layers.2": "cpu",  # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)
```

Manual device placement:
```python
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0")  # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
```

### Precision Control

**torch_dtype**: Set model precision
```python
import torch

# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)

# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
```

### Attention Implementation

**attn_implementation**: Choose attention mechanism
```python
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")

# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
```

### Memory Optimization

**low_cpu_mem_usage**: Reduce CPU memory during loading
```python
model = AutoModelForCausalLM.from_pretrained(
    "large-model-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)
```

**load_in_8bit**: 8-bit quantization (requires bitsandbytes)
```python
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    load_in_8bit=True,
    device_map="auto"
)
```

**load_in_4bit**: 4-bit quantization
```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    quantization_config=quantization_config,
    device_map="auto"
)
```

## Model Configuration

### Loading with Custom Config

```python
from transformers import AutoConfig, AutoModel

# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2

# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
```

### Initializing from Config Only

```python
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)  # Random weights
```

## Model Modes

### Training vs Evaluation Mode

Models load in evaluation mode by default:

```python
model = AutoModel.from_pretrained("model-id")
print(model.training)  # False

# Switch to training mode
model.train()

# Switch back to evaluation mode
model.eval()
```

Evaluation mode disables dropout and uses batch norm statistics.

## Saving Models

### Save Locally

```python
model.save_pretrained("./my_model")
```

This creates:
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights

### Save to Hugging Face Hub

```python
model.push_to_hub("username/model-name")

# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")

# Private repository
model.push_to_hub("username/model-name", private=True)
```

## Model Inspection

### Parameter Count

```python
# Total parameters
total_params = model.num_parameters()

# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)

print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")
```

### Memory Footprint

```python
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")
```

### Model Architecture

```python
print(model)  # Print full architecture

# Access specific components
print(model.config)
print(model.base_model)
```

## Forward Pass

Basic inference:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")

inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
predictions = logits.argmax(dim=-1)
```

## Model Formats

### SafeTensors vs PyTorch

SafeTensors is faster and safer:

```python
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)

# Load either format automatically
model = AutoModel.from_pretrained("./model")
```

### ONNX Export

Export for optimized inference:

```python
from transformers.onnx import export

# Export to ONNX
export(
    tokenizer=tokenizer,
    model=model,
    config=config,
    output=Path("model.onnx")
)
```

## Best Practices

1. **Use AutoModel classes**: Automatic architecture detection
2. **Specify dtype explicitly**: Control precision and memory
3. **Use device_map="auto"**: For large models
4. **Enable low_cpu_mem_usage**: When loading large models
5. **Use safetensors format**: Faster and safer serialization
6. **Check model.training**: Ensure correct mode for task
7. **Consider quantization**: For deployment on resource-constrained devices
8. **Cache models locally**: Set TRANSFORMERS_CACHE environment variable

## Common Issues

**CUDA out of memory:**
```python
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)

# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")
```

**Slow loading:**
```python
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
```

**Model not found:**
```python
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()
```