8.4 KiB
Model Loading and Management
Overview
The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.
Loading Models
AutoModel Classes
Use AutoModel classes for automatic architecture selection:
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")
# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
Common AutoModel Classes
NLP Tasks:
AutoModelForSequenceClassification: Text classification, sentiment analysisAutoModelForTokenClassification: NER, POS taggingAutoModelForQuestionAnswering: Extractive QAAutoModelForCausalLM: Text generation (GPT, Llama)AutoModelForMaskedLM: Masked language modeling (BERT)AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)
Vision Tasks:
AutoModelForImageClassification: Image classificationAutoModelForObjectDetection: Object detectionAutoModelForImageSegmentation: Image segmentation
Audio Tasks:
AutoModelForAudioClassification: Audio classificationAutoModelForSpeechSeq2Seq: Speech recognition
Multimodal:
AutoModelForVision2Seq: Image captioning, VQA
Loading Parameters
Basic Parameters
pretrained_model_name_or_path: Model identifier or local path
model = AutoModel.from_pretrained("bert-base-uncased") # From Hub
model = AutoModel.from_pretrained("./local/model/path") # From disk
num_labels: Number of output labels for classification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3
)
cache_dir: Custom cache location
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
Device Management
device_map: Automatic device allocation for large models
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
"model-id",
device_map="sequential"
)
# Custom device map
device_map = {
"transformer.layers.0": 0, # GPU 0
"transformer.layers.1": 1, # GPU 1
"transformer.layers.2": "cpu", # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)
Manual device placement:
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0") # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
Precision Control
torch_dtype: Set model precision
import torch
# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)
# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
Attention Implementation
attn_implementation: Choose attention mechanism
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")
# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
Memory Optimization
low_cpu_mem_usage: Reduce CPU memory during loading
model = AutoModelForCausalLM.from_pretrained(
"large-model-id",
low_cpu_mem_usage=True,
device_map="auto"
)
load_in_8bit: 8-bit quantization (requires bitsandbytes)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
load_in_8bit=True,
device_map="auto"
)
load_in_4bit: 4-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
quantization_config=quantization_config,
device_map="auto"
)
Model Configuration
Loading with Custom Config
from transformers import AutoConfig, AutoModel
# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
Initializing from Config Only
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config) # Random weights
Model Modes
Training vs Evaluation Mode
Models load in evaluation mode by default:
model = AutoModel.from_pretrained("model-id")
print(model.training) # False
# Switch to training mode
model.train()
# Switch back to evaluation mode
model.eval()
Evaluation mode disables dropout and uses batch norm statistics.
Saving Models
Save Locally
model.save_pretrained("./my_model")
This creates:
config.json: Model configurationpytorch_model.binormodel.safetensors: Model weights
Save to Hugging Face Hub
model.push_to_hub("username/model-name")
# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")
# Private repository
model.push_to_hub("username/model-name", private=True)
Model Inspection
Parameter Count
# Total parameters
total_params = model.num_parameters()
# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)
print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")
Memory Footprint
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")
Model Architecture
print(model) # Print full architecture
# Access specific components
print(model.config)
print(model.base_model)
Forward Pass
Basic inference:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
Model Formats
SafeTensors vs PyTorch
SafeTensors is faster and safer:
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)
# Load either format automatically
model = AutoModel.from_pretrained("./model")
ONNX Export
Export for optimized inference:
from transformers.onnx import export
# Export to ONNX
export(
tokenizer=tokenizer,
model=model,
config=config,
output=Path("model.onnx")
)
Best Practices
- Use AutoModel classes: Automatic architecture detection
- Specify dtype explicitly: Control precision and memory
- Use device_map="auto": For large models
- Enable low_cpu_mem_usage: When loading large models
- Use safetensors format: Faster and safer serialization
- Check model.training: Ensure correct mode for task
- Consider quantization: For deployment on resource-constrained devices
- Cache models locally: Set TRANSFORMERS_CACHE environment variable
Common Issues
CUDA out of memory:
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)
# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")
Slow loading:
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
Model not found:
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()