# Tokenizers
## Overview
Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.
## Loading Tokenizers
### AutoTokenizer
Automatically load the correct tokenizer for a model:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```
Load from local path:
```python
tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")
```
## Basic Tokenization
### Encode Text
```python
# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
```
### Decode Tokens
```python
token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text) # "hello, how are you?"
# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text) # "hello, how are you?"
```
## The `__call__` Method
Primary tokenization interface:
```python
# Single text
inputs = tokenizer("Hello, how are you?")
# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
# 'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }
```
Multiple texts:
```python
texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)
```
## Key Parameters
### Return Tensors
**return_tensors**: Output format ("pt", "tf", "np")
```python
# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")
# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")
# NumPy arrays
inputs = tokenizer("text", return_tensors="np")
```
### Padding
**padding**: Pad sequences to same length
```python
# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)
# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)
# No padding
inputs = tokenizer(texts, padding=False)
```
**pad_to_multiple_of**: Pad to multiple of specified value
```python
inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)
```
### Truncation
**truncation**: Limit sequence length
```python
# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)
# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")
# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")
# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)
```
### Max Length
**max_length**: Maximum sequence length
```python
inputs = tokenizer(text, max_length=512, truncation=True)
```
### Additional Outputs
**return_attention_mask**: Include attention mask (default True)
```python
inputs = tokenizer(text, return_attention_mask=True)
```
**return_token_type_ids**: Segment IDs for sentence pairs
```python
inputs = tokenizer(text1, text2, return_token_type_ids=True)
```
**return_offsets_mapping**: Character position mapping (Fast tokenizers only)
```python
inputs = tokenizer(text, return_offsets_mapping=True)
```
**return_length**: Include sequence lengths
```python
inputs = tokenizer(texts, padding=True, return_length=True)
```
## Special Tokens
### Predefined Special Tokens
Access special tokens:
```python
print(tokenizer.cls_token) # [CLS] or
print(tokenizer.sep_token) # [SEP] or
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.mask_token) # [MASK]
print(tokenizer.eos_token) # End of sequence
print(tokenizer.bos_token) # Beginning of sequence
# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)
```
### Add Special Tokens
Manual control:
```python
# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)
# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)
```
### Custom Special Tokens
```python
special_tokens_dict = {
"additional_special_tokens": ["", ""]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")
# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))
```
## Sentence Pairs
Tokenize text pairs:
```python
text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."
# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)
# Results in: [CLS] text1 [SEP] text2 [SEP]
```
## Batch Encoding
Process multiple texts:
```python
texts = ["First text", "Second text", "Third text"]
# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Access individual encodings
for i in range(len(texts)):
input_ids = batch["input_ids"][i]
attention_mask = batch["attention_mask"][i]
```
## Fast Tokenizers
Use Rust-based tokenizers for speed:
```python
from transformers import AutoTokenizer
# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if Fast
print(tokenizer.is_fast) # True
# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
```
### Fast Tokenizer Features
**Offset mapping** (character positions):
```python
inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)] # [CLS], "Hello", "world", [SEP]
```
**Token to word mapping**:
```python
encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids) # [None, 0, 1, None] # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None
```
## Saving Tokenizers
Save locally:
```python
tokenizer.save_pretrained("./my_tokenizer")
```
Push to Hub:
```python
tokenizer.push_to_hub("username/my-tokenizer")
```
## Advanced Usage
### Vocabulary
Access vocabulary:
```python
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)
# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")
```
### Encoding Details
Get detailed encoding information:
```python
encoding = tokenizer("Hello world", return_tensors="pt")
# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()
```
### Custom Preprocessing
Subclass for custom behavior:
```python
class CustomTokenizer(AutoTokenizer):
def __call__(self, text, **kwargs):
# Custom preprocessing
text = text.lower().strip()
return super().__call__(text, **kwargs)
```
## Chat Templates
For conversational models:
```python
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)
# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
```
## Common Patterns
### Pattern 1: Simple Text Classification
```python
texts = ["I love this!", "I hate this!"]
labels = [1, 0]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))
```
### Pattern 2: Question Answering
```python
question = "What is the capital?"
context = "Paris is the capital of France."
inputs = tokenizer(
question,
context,
padding=True,
truncation=True,
max_length=384,
return_tensors="pt"
)
```
### Pattern 3: Text Generation
```python
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Pattern 4: Dataset Tokenization
```python
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
## Best Practices
1. **Always specify return_tensors**: For model input
2. **Use padding and truncation**: For batch processing
3. **Set max_length explicitly**: Prevent memory issues
4. **Use Fast tokenizers**: When available for speed
5. **Handle pad_token**: Set to eos_token if None for generation
6. **Add special tokens**: Leave enabled (default) unless specific reason
7. **Resize embeddings**: After adding custom tokens
8. **Decode with skip_special_tokens**: For cleaner output
9. **Use batched processing**: For efficiency with datasets
10. **Save tokenizer with model**: Ensure compatibility
## Common Issues
**Padding token not set:**
```python
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
```
**Sequence too long:**
```python
# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)
```
**Mismatched vocabulary:**
```python
# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")
```
**Attention mask issues:**
```python
# Ensure attention_mask is passed
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)
```