# Model Loading and Management ## Overview The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control. ## Loading Models ### AutoModel Classes Use AutoModel classes for automatic architecture selection: ```python from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM # Base model (no task head) model = AutoModel.from_pretrained("bert-base-uncased") # Sequence classification model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") # Causal language modeling (GPT-style) model = AutoModelForCausalLM.from_pretrained("gpt2") # Masked language modeling (BERT-style) from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased") # Sequence-to-sequence (T5-style) from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") ``` ### Common AutoModel Classes **NLP Tasks:** - `AutoModelForSequenceClassification`: Text classification, sentiment analysis - `AutoModelForTokenClassification`: NER, POS tagging - `AutoModelForQuestionAnswering`: Extractive QA - `AutoModelForCausalLM`: Text generation (GPT, Llama) - `AutoModelForMaskedLM`: Masked language modeling (BERT) - `AutoModelForSeq2SeqLM`: Translation, summarization (T5, BART) **Vision Tasks:** - `AutoModelForImageClassification`: Image classification - `AutoModelForObjectDetection`: Object detection - `AutoModelForImageSegmentation`: Image segmentation **Audio Tasks:** - `AutoModelForAudioClassification`: Audio classification - `AutoModelForSpeechSeq2Seq`: Speech recognition **Multimodal:** - `AutoModelForVision2Seq`: Image captioning, VQA ## Loading Parameters ### Basic Parameters **pretrained_model_name_or_path**: Model identifier or local path ```python model = AutoModel.from_pretrained("bert-base-uncased") # From Hub model = AutoModel.from_pretrained("./local/model/path") # From disk ``` **num_labels**: Number of output labels for classification ```python model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=3 ) ``` **cache_dir**: Custom cache location ```python model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache") ``` ### Device Management **device_map**: Automatic device allocation for large models ```python # Automatically distribute across GPUs and CPU model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto" ) # Sequential placement model = AutoModelForCausalLM.from_pretrained( "model-id", device_map="sequential" ) # Custom device map device_map = { "transformer.layers.0": 0, # GPU 0 "transformer.layers.1": 1, # GPU 1 "transformer.layers.2": "cpu", # CPU } model = AutoModel.from_pretrained("model-id", device_map=device_map) ``` Manual device placement: ```python import torch model = AutoModel.from_pretrained("model-id") model.to("cuda:0") # Move to GPU 0 model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu")) ``` ### Precision Control **torch_dtype**: Set model precision ```python import torch # Float16 (half precision) model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16) # BFloat16 (better range than float16) model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16) # Auto (use original dtype) model = AutoModel.from_pretrained("model-id", torch_dtype="auto") ``` ### Attention Implementation **attn_implementation**: Choose attention mechanism ```python # Scaled Dot Product Attention (PyTorch 2.0+, fastest) model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa") # Flash Attention 2 (requires flash-attn package) model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2") # Eager (default, most compatible) model = AutoModel.from_pretrained("model-id", attn_implementation="eager") ``` ### Memory Optimization **low_cpu_mem_usage**: Reduce CPU memory during loading ```python model = AutoModelForCausalLM.from_pretrained( "large-model-id", low_cpu_mem_usage=True, device_map="auto" ) ``` **load_in_8bit**: 8-bit quantization (requires bitsandbytes) ```python model = AutoModelForCausalLM.from_pretrained( "model-id", load_in_8bit=True, device_map="auto" ) ``` **load_in_4bit**: 4-bit quantization ```python from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model = AutoModelForCausalLM.from_pretrained( "model-id", quantization_config=quantization_config, device_map="auto" ) ``` ## Model Configuration ### Loading with Custom Config ```python from transformers import AutoConfig, AutoModel # Load and modify config config = AutoConfig.from_pretrained("bert-base-uncased") config.hidden_dropout_prob = 0.2 config.attention_probs_dropout_prob = 0.2 # Initialize model with custom config model = AutoModel.from_pretrained("bert-base-uncased", config=config) ``` ### Initializing from Config Only ```python config = AutoConfig.from_pretrained("gpt2") model = AutoModelForCausalLM.from_config(config) # Random weights ``` ## Model Modes ### Training vs Evaluation Mode Models load in evaluation mode by default: ```python model = AutoModel.from_pretrained("model-id") print(model.training) # False # Switch to training mode model.train() # Switch back to evaluation mode model.eval() ``` Evaluation mode disables dropout and uses batch norm statistics. ## Saving Models ### Save Locally ```python model.save_pretrained("./my_model") ``` This creates: - `config.json`: Model configuration - `pytorch_model.bin` or `model.safetensors`: Model weights ### Save to Hugging Face Hub ```python model.push_to_hub("username/model-name") # With custom commit message model.push_to_hub("username/model-name", commit_message="Update model") # Private repository model.push_to_hub("username/model-name", private=True) ``` ## Model Inspection ### Parameter Count ```python # Total parameters total_params = model.num_parameters() # Trainable parameters only trainable_params = model.num_parameters(only_trainable=True) print(f"Total: {total_params:,}") print(f"Trainable: {trainable_params:,}") ``` ### Memory Footprint ```python memory_bytes = model.get_memory_footprint() memory_mb = memory_bytes / 1024**2 print(f"Memory: {memory_mb:.2f} MB") ``` ### Model Architecture ```python print(model) # Print full architecture # Access specific components print(model.config) print(model.base_model) ``` ## Forward Pass Basic inference: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("model-id") model = AutoModelForSequenceClassification.from_pretrained("model-id") inputs = tokenizer("Sample text", return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predictions = logits.argmax(dim=-1) ``` ## Model Formats ### SafeTensors vs PyTorch SafeTensors is faster and safer: ```python # Save as safetensors (recommended) model.save_pretrained("./model", safe_serialization=True) # Load either format automatically model = AutoModel.from_pretrained("./model") ``` ### ONNX Export Export for optimized inference: ```python from transformers.onnx import export # Export to ONNX export( tokenizer=tokenizer, model=model, config=config, output=Path("model.onnx") ) ``` ## Best Practices 1. **Use AutoModel classes**: Automatic architecture detection 2. **Specify dtype explicitly**: Control precision and memory 3. **Use device_map="auto"**: For large models 4. **Enable low_cpu_mem_usage**: When loading large models 5. **Use safetensors format**: Faster and safer serialization 6. **Check model.training**: Ensure correct mode for task 7. **Consider quantization**: For deployment on resource-constrained devices 8. **Cache models locally**: Set TRANSFORMERS_CACHE environment variable ## Common Issues **CUDA out of memory:** ```python # Use smaller precision model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16) # Or use quantization model = AutoModel.from_pretrained("model-id", load_in_8bit=True) # Or use CPU model = AutoModel.from_pretrained("model-id", device_map="cpu") ``` **Slow loading:** ```python # Enable low CPU memory mode model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True) ``` **Model not found:** ```python # Verify model ID on hub.co # Check authentication for private models from huggingface_hub import login login() ```