9.1 KiB
Text Generation
Overview
Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.
Basic Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Generate
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Generation Strategies
Greedy Decoding
Select highest probability token at each step (deterministic):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False # Greedy decoding (default)
)
Use for: Factual text, translations, where determinism is needed.
Sampling
Randomly sample from probability distribution:
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
Use for: Creative writing, diverse outputs, open-ended generation.
Beam Search
Explore multiple hypotheses in parallel:
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
Use for: Translations, summarization, where quality is critical.
Contrastive Search
Balance quality and diversity:
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6,
top_k=4
)
Use for: Long-form generation, reducing repetition.
Key Parameters
Length Control
max_new_tokens: Maximum tokens to generate
max_new_tokens=100 # Generate up to 100 new tokens
max_length: Maximum total length (input + output)
max_length=512 # Total sequence length
min_new_tokens: Minimum tokens to generate
min_new_tokens=50 # Force at least 50 tokens
min_length: Minimum total length
min_length=100
Temperature
Controls randomness (only with sampling):
temperature=1.0 # Default, balanced
temperature=0.7 # More focused, less random
temperature=1.5 # More creative, more random
Lower temperature → more deterministic Higher temperature → more random
Top-K Sampling
Consider only top K most likely tokens:
do_sample=True
top_k=50 # Sample from top 50 tokens
Common values: 40-100 for balanced output, 10-20 for focused output.
Top-P (Nucleus) Sampling
Consider tokens with cumulative probability ≥ P:
do_sample=True
top_p=0.95 # Sample from smallest set with 95% cumulative probability
Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.
Repetition Penalty
Discourage repetition:
repetition_penalty=1.2 # Penalize repeated tokens
Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.
Beam Search Parameters
num_beams: Number of beams
num_beams=5 # Keep 5 hypotheses
early_stopping: Stop when num_beams sentences are finished
early_stopping=True
no_repeat_ngram_size: Prevent n-gram repetition
no_repeat_ngram_size=3 # Don't repeat any 3-gram
Output Control
num_return_sequences: Generate multiple outputs
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_return_sequences=3 # Return 3 different sequences
)
pad_token_id: Specify padding token
pad_token_id=tokenizer.eos_token_id
eos_token_id: Stop generation at specific token
eos_token_id=tokenizer.eos_token_id
Advanced Features
Batch Generation
Generate for multiple prompts:
prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i}: {text}\n")
Streaming Generation
Stream tokens as generated:
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
Constrained Generation
Force specific token sequences:
# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
num_beams=5
)
Guidance and Control
Prevent bad words:
bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids
)
Generation Config
Save and reuse generation parameters:
from transformers import GenerationConfig
# Create config
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
# Save
generation_config.save_pretrained("./my_generation_config")
# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)
Model-Specific Generation
Chat Models
Use chat templates:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Encoder-Decoder Models
For T5, BART, etc.:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
Optimization
Caching
Enable KV cache for faster generation:
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True # Default, faster generation
)
Static Cache
For fixed sequence lengths:
from transformers import StaticCache
cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
past_key_values=cache
)
Attention Implementation
Use Flash Attention for speed:
model = AutoModelForCausalLM.from_pretrained(
"model-id",
attn_implementation="flash_attention_2"
)
Generation Recipes
Creative Writing
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2
)
Factual Generation
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy
repetition_penalty=1.1
)
Diverse Outputs
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
num_return_sequences=5,
temperature=1.5,
do_sample=True
)
Long-Form Generation
outputs = model.generate(
**inputs,
max_new_tokens=1000,
penalty_alpha=0.6, # Contrastive search
top_k=4,
repetition_penalty=1.2
)
Translation/Summarization
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3
)
Common Issues
Repetitive output:
- Increase repetition_penalty (1.2-1.5)
- Use no_repeat_ngram_size (2-3)
- Try contrastive search
- Lower temperature
Poor quality:
- Use beam search (num_beams=5)
- Lower temperature
- Adjust top_k/top_p
Too deterministic:
- Enable sampling (do_sample=True)
- Increase temperature (0.7-1.0)
- Adjust top_k/top_p
Slow generation:
- Reduce batch size
- Enable use_cache=True
- Use Flash Attention
- Reduce max_new_tokens
Best Practices
- Start with defaults: Then tune based on output
- Use appropriate strategy: Greedy for factual, sampling for creative
- Set max_new_tokens: Avoid unnecessarily long generation
- Enable caching: For faster sequential generation
- Tune temperature: Most impactful parameter for sampling
- Use beam search carefully: Slower but higher quality
- Test different seeds: For reproducibility with sampling
- Monitor memory: Large beams use significant memory