362 lines
8.4 KiB
Markdown
362 lines
8.4 KiB
Markdown
# Model Loading and Management
|
|
|
|
## Overview
|
|
|
|
The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.
|
|
|
|
## Loading Models
|
|
|
|
### AutoModel Classes
|
|
|
|
Use AutoModel classes for automatic architecture selection:
|
|
|
|
```python
|
|
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
|
|
|
|
# Base model (no task head)
|
|
model = AutoModel.from_pretrained("bert-base-uncased")
|
|
|
|
# Sequence classification
|
|
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
|
|
|
|
# Causal language modeling (GPT-style)
|
|
model = AutoModelForCausalLM.from_pretrained("gpt2")
|
|
|
|
# Masked language modeling (BERT-style)
|
|
from transformers import AutoModelForMaskedLM
|
|
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
|
|
|
|
# Sequence-to-sequence (T5-style)
|
|
from transformers import AutoModelForSeq2SeqLM
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
|
|
```
|
|
|
|
### Common AutoModel Classes
|
|
|
|
**NLP Tasks:**
|
|
- `AutoModelForSequenceClassification`: Text classification, sentiment analysis
|
|
- `AutoModelForTokenClassification`: NER, POS tagging
|
|
- `AutoModelForQuestionAnswering`: Extractive QA
|
|
- `AutoModelForCausalLM`: Text generation (GPT, Llama)
|
|
- `AutoModelForMaskedLM`: Masked language modeling (BERT)
|
|
- `AutoModelForSeq2SeqLM`: Translation, summarization (T5, BART)
|
|
|
|
**Vision Tasks:**
|
|
- `AutoModelForImageClassification`: Image classification
|
|
- `AutoModelForObjectDetection`: Object detection
|
|
- `AutoModelForImageSegmentation`: Image segmentation
|
|
|
|
**Audio Tasks:**
|
|
- `AutoModelForAudioClassification`: Audio classification
|
|
- `AutoModelForSpeechSeq2Seq`: Speech recognition
|
|
|
|
**Multimodal:**
|
|
- `AutoModelForVision2Seq`: Image captioning, VQA
|
|
|
|
## Loading Parameters
|
|
|
|
### Basic Parameters
|
|
|
|
**pretrained_model_name_or_path**: Model identifier or local path
|
|
```python
|
|
model = AutoModel.from_pretrained("bert-base-uncased") # From Hub
|
|
model = AutoModel.from_pretrained("./local/model/path") # From disk
|
|
```
|
|
|
|
**num_labels**: Number of output labels for classification
|
|
```python
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
"bert-base-uncased",
|
|
num_labels=3
|
|
)
|
|
```
|
|
|
|
**cache_dir**: Custom cache location
|
|
```python
|
|
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
|
|
```
|
|
|
|
### Device Management
|
|
|
|
**device_map**: Automatic device allocation for large models
|
|
```python
|
|
# Automatically distribute across GPUs and CPU
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"meta-llama/Llama-2-7b-hf",
|
|
device_map="auto"
|
|
)
|
|
|
|
# Sequential placement
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"model-id",
|
|
device_map="sequential"
|
|
)
|
|
|
|
# Custom device map
|
|
device_map = {
|
|
"transformer.layers.0": 0, # GPU 0
|
|
"transformer.layers.1": 1, # GPU 1
|
|
"transformer.layers.2": "cpu", # CPU
|
|
}
|
|
model = AutoModel.from_pretrained("model-id", device_map=device_map)
|
|
```
|
|
|
|
Manual device placement:
|
|
```python
|
|
import torch
|
|
model = AutoModel.from_pretrained("model-id")
|
|
model.to("cuda:0") # Move to GPU 0
|
|
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
|
|
```
|
|
|
|
### Precision Control
|
|
|
|
**torch_dtype**: Set model precision
|
|
```python
|
|
import torch
|
|
|
|
# Float16 (half precision)
|
|
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
|
|
|
|
# BFloat16 (better range than float16)
|
|
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)
|
|
|
|
# Auto (use original dtype)
|
|
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
|
|
```
|
|
|
|
### Attention Implementation
|
|
|
|
**attn_implementation**: Choose attention mechanism
|
|
```python
|
|
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
|
|
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")
|
|
|
|
# Flash Attention 2 (requires flash-attn package)
|
|
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")
|
|
|
|
# Eager (default, most compatible)
|
|
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
|
|
```
|
|
|
|
### Memory Optimization
|
|
|
|
**low_cpu_mem_usage**: Reduce CPU memory during loading
|
|
```python
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"large-model-id",
|
|
low_cpu_mem_usage=True,
|
|
device_map="auto"
|
|
)
|
|
```
|
|
|
|
**load_in_8bit**: 8-bit quantization (requires bitsandbytes)
|
|
```python
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"model-id",
|
|
load_in_8bit=True,
|
|
device_map="auto"
|
|
)
|
|
```
|
|
|
|
**load_in_4bit**: 4-bit quantization
|
|
```python
|
|
from transformers import BitsAndBytesConfig
|
|
|
|
quantization_config = BitsAndBytesConfig(
|
|
load_in_4bit=True,
|
|
bnb_4bit_compute_dtype=torch.float16
|
|
)
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"model-id",
|
|
quantization_config=quantization_config,
|
|
device_map="auto"
|
|
)
|
|
```
|
|
|
|
## Model Configuration
|
|
|
|
### Loading with Custom Config
|
|
|
|
```python
|
|
from transformers import AutoConfig, AutoModel
|
|
|
|
# Load and modify config
|
|
config = AutoConfig.from_pretrained("bert-base-uncased")
|
|
config.hidden_dropout_prob = 0.2
|
|
config.attention_probs_dropout_prob = 0.2
|
|
|
|
# Initialize model with custom config
|
|
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
|
|
```
|
|
|
|
### Initializing from Config Only
|
|
|
|
```python
|
|
config = AutoConfig.from_pretrained("gpt2")
|
|
model = AutoModelForCausalLM.from_config(config) # Random weights
|
|
```
|
|
|
|
## Model Modes
|
|
|
|
### Training vs Evaluation Mode
|
|
|
|
Models load in evaluation mode by default:
|
|
|
|
```python
|
|
model = AutoModel.from_pretrained("model-id")
|
|
print(model.training) # False
|
|
|
|
# Switch to training mode
|
|
model.train()
|
|
|
|
# Switch back to evaluation mode
|
|
model.eval()
|
|
```
|
|
|
|
Evaluation mode disables dropout and uses batch norm statistics.
|
|
|
|
## Saving Models
|
|
|
|
### Save Locally
|
|
|
|
```python
|
|
model.save_pretrained("./my_model")
|
|
```
|
|
|
|
This creates:
|
|
- `config.json`: Model configuration
|
|
- `pytorch_model.bin` or `model.safetensors`: Model weights
|
|
|
|
### Save to Hugging Face Hub
|
|
|
|
```python
|
|
model.push_to_hub("username/model-name")
|
|
|
|
# With custom commit message
|
|
model.push_to_hub("username/model-name", commit_message="Update model")
|
|
|
|
# Private repository
|
|
model.push_to_hub("username/model-name", private=True)
|
|
```
|
|
|
|
## Model Inspection
|
|
|
|
### Parameter Count
|
|
|
|
```python
|
|
# Total parameters
|
|
total_params = model.num_parameters()
|
|
|
|
# Trainable parameters only
|
|
trainable_params = model.num_parameters(only_trainable=True)
|
|
|
|
print(f"Total: {total_params:,}")
|
|
print(f"Trainable: {trainable_params:,}")
|
|
```
|
|
|
|
### Memory Footprint
|
|
|
|
```python
|
|
memory_bytes = model.get_memory_footprint()
|
|
memory_mb = memory_bytes / 1024**2
|
|
print(f"Memory: {memory_mb:.2f} MB")
|
|
```
|
|
|
|
### Model Architecture
|
|
|
|
```python
|
|
print(model) # Print full architecture
|
|
|
|
# Access specific components
|
|
print(model.config)
|
|
print(model.base_model)
|
|
```
|
|
|
|
## Forward Pass
|
|
|
|
Basic inference:
|
|
|
|
```python
|
|
from transformers import AutoTokenizer
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("model-id")
|
|
model = AutoModelForSequenceClassification.from_pretrained("model-id")
|
|
|
|
inputs = tokenizer("Sample text", return_tensors="pt")
|
|
outputs = model(**inputs)
|
|
|
|
logits = outputs.logits
|
|
predictions = logits.argmax(dim=-1)
|
|
```
|
|
|
|
## Model Formats
|
|
|
|
### SafeTensors vs PyTorch
|
|
|
|
SafeTensors is faster and safer:
|
|
|
|
```python
|
|
# Save as safetensors (recommended)
|
|
model.save_pretrained("./model", safe_serialization=True)
|
|
|
|
# Load either format automatically
|
|
model = AutoModel.from_pretrained("./model")
|
|
```
|
|
|
|
### ONNX Export
|
|
|
|
Export for optimized inference:
|
|
|
|
```python
|
|
from transformers.onnx import export
|
|
|
|
# Export to ONNX
|
|
export(
|
|
tokenizer=tokenizer,
|
|
model=model,
|
|
config=config,
|
|
output=Path("model.onnx")
|
|
)
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use AutoModel classes**: Automatic architecture detection
|
|
2. **Specify dtype explicitly**: Control precision and memory
|
|
3. **Use device_map="auto"**: For large models
|
|
4. **Enable low_cpu_mem_usage**: When loading large models
|
|
5. **Use safetensors format**: Faster and safer serialization
|
|
6. **Check model.training**: Ensure correct mode for task
|
|
7. **Consider quantization**: For deployment on resource-constrained devices
|
|
8. **Cache models locally**: Set TRANSFORMERS_CACHE environment variable
|
|
|
|
## Common Issues
|
|
|
|
**CUDA out of memory:**
|
|
```python
|
|
# Use smaller precision
|
|
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
|
|
|
|
# Or use quantization
|
|
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)
|
|
|
|
# Or use CPU
|
|
model = AutoModel.from_pretrained("model-id", device_map="cpu")
|
|
```
|
|
|
|
**Slow loading:**
|
|
```python
|
|
# Enable low CPU memory mode
|
|
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
|
|
```
|
|
|
|
**Model not found:**
|
|
```python
|
|
# Verify model ID on hub.co
|
|
# Check authentication for private models
|
|
from huggingface_hub import login
|
|
login()
|
|
```
|