Files
gh-k-dense-ai-claude-scient…/skills/transformers/references/generation.md
2025-11-30 08:30:10 +08:00

9.1 KiB

Text Generation

Overview

Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")

# Generate
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Generation Strategies

Greedy Decoding

Select highest probability token at each step (deterministic):

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False  # Greedy decoding (default)
)

Use for: Factual text, translations, where determinism is needed.

Sampling

Randomly sample from probability distribution:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

Use for: Creative writing, diverse outputs, open-ended generation.

Explore multiple hypotheses in parallel:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True
)

Use for: Translations, summarization, where quality is critical.

Balance quality and diversity:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    penalty_alpha=0.6,
    top_k=4
)

Use for: Long-form generation, reducing repetition.

Key Parameters

Length Control

max_new_tokens: Maximum tokens to generate

max_new_tokens=100  # Generate up to 100 new tokens

max_length: Maximum total length (input + output)

max_length=512  # Total sequence length

min_new_tokens: Minimum tokens to generate

min_new_tokens=50  # Force at least 50 tokens

min_length: Minimum total length

min_length=100

Temperature

Controls randomness (only with sampling):

temperature=1.0   # Default, balanced
temperature=0.7   # More focused, less random
temperature=1.5   # More creative, more random

Lower temperature → more deterministic Higher temperature → more random

Top-K Sampling

Consider only top K most likely tokens:

do_sample=True
top_k=50  # Sample from top 50 tokens

Common values: 40-100 for balanced output, 10-20 for focused output.

Top-P (Nucleus) Sampling

Consider tokens with cumulative probability ≥ P:

do_sample=True
top_p=0.95  # Sample from smallest set with 95% cumulative probability

Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.

Repetition Penalty

Discourage repetition:

repetition_penalty=1.2  # Penalize repeated tokens

Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.

Beam Search Parameters

num_beams: Number of beams

num_beams=5  # Keep 5 hypotheses

early_stopping: Stop when num_beams sentences are finished

early_stopping=True

no_repeat_ngram_size: Prevent n-gram repetition

no_repeat_ngram_size=3  # Don't repeat any 3-gram

Output Control

num_return_sequences: Generate multiple outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    num_return_sequences=3  # Return 3 different sequences
)

pad_token_id: Specify padding token

pad_token_id=tokenizer.eos_token_id

eos_token_id: Stop generation at specific token

eos_token_id=tokenizer.eos_token_id

Advanced Features

Batch Generation

Generate for multiple prompts:

prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

outputs = model.generate(**inputs, max_new_tokens=50)

for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Prompt {i}: {text}\n")

Streaming Generation

Stream tokens as generated:

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = dict(
    inputs,
    streamer=streamer,
    max_new_tokens=100
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for text in streamer:
    print(text, end="", flush=True)

thread.join()

Constrained Generation

Force specific token sequences:

# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]

outputs = model.generate(
    **inputs,
    force_words_ids=force_words_ids,
    num_beams=5
)

Guidance and Control

Prevent bad words:

bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]

outputs = model.generate(
    **inputs,
    bad_words_ids=bad_words_ids
)

Generation Config

Save and reuse generation parameters:

from transformers import GenerationConfig

# Create config
generation_config = GenerationConfig(
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

# Save
generation_config.save_pretrained("./my_generation_config")

# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)

Model-Specific Generation

Chat Models

Use chat templates:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Encoder-Decoder Models

For T5, BART, etc.:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Optimization

Caching

Enable KV cache for faster generation:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True  # Default, faster generation
)

Static Cache

For fixed sequence lengths:

from transformers import StaticCache

cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    past_key_values=cache
)

Attention Implementation

Use Flash Attention for speed:

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    attn_implementation="flash_attention_2"
)

Generation Recipes

Creative Writing

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2
)

Factual Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=False,  # Greedy
    repetition_penalty=1.1
)

Diverse Outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    num_return_sequences=5,
    temperature=1.5,
    do_sample=True
)

Long-Form Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
    penalty_alpha=0.6,  # Contrastive search
    top_k=4,
    repetition_penalty=1.2
)

Translation/Summarization

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3
)

Common Issues

Repetitive output:

  • Increase repetition_penalty (1.2-1.5)
  • Use no_repeat_ngram_size (2-3)
  • Try contrastive search
  • Lower temperature

Poor quality:

  • Use beam search (num_beams=5)
  • Lower temperature
  • Adjust top_k/top_p

Too deterministic:

  • Enable sampling (do_sample=True)
  • Increase temperature (0.7-1.0)
  • Adjust top_k/top_p

Slow generation:

  • Reduce batch size
  • Enable use_cache=True
  • Use Flash Attention
  • Reduce max_new_tokens

Best Practices

  1. Start with defaults: Then tune based on output
  2. Use appropriate strategy: Greedy for factual, sampling for creative
  3. Set max_new_tokens: Avoid unnecessarily long generation
  4. Enable caching: For faster sequential generation
  5. Tune temperature: Most impactful parameter for sampling
  6. Use beam search carefully: Slower but higher quality
  7. Test different seeds: For reproducibility with sampling
  8. Monitor memory: Large beams use significant memory