Files
gh-rafaelcalleja-claude-mar…/skills/ai-multimodal/references/image-generation.md
2025-11-30 08:48:52 +08:00

559 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Image Generation Reference
Comprehensive guide for image creation, editing, and composition using Gemini API.
## Core Capabilities
- **Text-to-Image**: Generate images from text prompts
- **Image Editing**: Modify existing images with text instructions
- **Multi-Image Composition**: Combine up to 3 images
- **Iterative Refinement**: Refine images conversationally
- **Aspect Ratios**: Multiple formats (1:1, 16:9, 9:16, 4:3, 3:4)
- **Style Control**: Control artistic style and quality
- **Text in Images**: Limited text rendering (max 25 chars)
## Model
**gemini-2.5-flash-image** - Specialized for image generation
- Input tokens: 65,536
- Output tokens: 32,768
- Knowledge cutoff: June 2025
- Supports: Text and image inputs, image outputs
## Quick Start
### Basic Generation
```python
from google import genai
from google.genai import types
import os
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='A serene mountain landscape at sunset with snow-capped peaks',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='16:9'
)
)
# Save image
for i, part in enumerate(response.candidates[0].content.parts):
if part.inline_data:
with open(f'output-{i}.png', 'wb') as f:
f.write(part.inline_data.data)
```
## Aspect Ratios
| Ratio | Resolution | Use Case | Token Cost |
|-------|-----------|----------|------------|
| 1:1 | 1024×1024 | Social media, avatars | 1290 |
| 16:9 | 1344×768 | Landscapes, banners | 1290 |
| 9:16 | 768×1344 | Mobile, portraits | 1290 |
| 4:3 | 1152×896 | Traditional media | 1290 |
| 3:4 | 896×1152 | Vertical posters | 1290 |
All ratios cost the same: 1,290 tokens per image.
## Response Modalities
### Image Only
```python
config = types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='1:1'
)
```
### Text Only (No Image)
```python
config = types.GenerateContentConfig(
response_modalities=['text']
)
# Returns text description instead of generating image
```
### Both Image and Text
```python
config = types.GenerateContentConfig(
response_modalities=['image', 'text'],
aspect_ratio='16:9'
)
# Returns both generated image and description
```
## Image Editing
### Modify Existing Image
```python
import PIL.Image
# Load original
img = PIL.Image.open('original.png')
# Edit with instructions
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Add a red balloon floating in the sky',
img
],
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='16:9'
)
)
```
### Style Transfer
```python
img = PIL.Image.open('photo.jpg')
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Transform this into an oil painting style',
img
]
)
```
### Object Addition/Removal
```python
# Add object
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Add a vintage car parked on the street',
img
]
)
# Remove object
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Remove the person on the left side',
img
]
)
```
## Multi-Image Composition
### Combine Multiple Images
```python
img1 = PIL.Image.open('background.png')
img2 = PIL.Image.open('foreground.png')
img3 = PIL.Image.open('overlay.png')
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Combine these images into a cohesive scene',
img1,
img2,
img3
],
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='16:9'
)
)
```
**Note**: Recommended maximum 3 input images for best results.
## Prompt Engineering
### Effective Prompt Structure
**Three key elements**:
1. **Subject**: What to generate
2. **Context**: Environmental setting
3. **Style**: Artistic treatment
**Example**: "A robot [subject] in a futuristic city [context], cyberpunk style with neon lighting [style]"
### Quality Modifiers
**Technical terms**:
- "4K", "8K", "high resolution"
- "HDR", "high dynamic range"
- "professional photography"
- "studio lighting"
- "ultra detailed"
**Camera settings**:
- "35mm lens", "50mm lens"
- "shallow depth of field"
- "wide angle shot"
- "macro photography"
- "golden hour lighting"
### Style Keywords
**Art styles**:
- "oil painting", "watercolor", "sketch"
- "digital art", "concept art"
- "photorealistic", "hyperrealistic"
- "minimalist", "abstract"
- "cyberpunk", "steampunk", "fantasy"
**Mood and atmosphere**:
- "dramatic lighting", "soft lighting"
- "moody", "bright and cheerful"
- "mysterious", "whimsical"
- "dark and gritty", "pastel colors"
### Subject Description
**Be specific**:
- ❌ "A cat"
- ✅ "A fluffy orange tabby cat with green eyes"
**Add context**:
- ❌ "A building"
- ✅ "A modern glass skyscraper reflecting sunset clouds"
**Include details**:
- ❌ "A person"
- ✅ "A young woman in a red dress holding an umbrella"
### Composition and Framing
**Camera angles**:
- "bird's eye view", "aerial shot"
- "low angle", "high angle"
- "close-up", "wide shot"
- "centered composition"
- "rule of thirds"
**Perspective**:
- "first person view"
- "third person perspective"
- "isometric view"
- "forced perspective"
### Text in Images
**Limitations**:
- Maximum 25 characters total
- Up to 3 distinct text phrases
- Works best with simple text
**Best practices**:
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='A vintage poster with bold text "EXPLORE" at the top, mountain landscape, retro 1950s style'
)
```
**Font control**:
- "bold sans-serif title"
- "handwritten script"
- "vintage letterpress"
- "modern minimalist font"
## Advanced Techniques
### Iterative Refinement
```python
# Initial generation
response1 = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='A futuristic city skyline'
)
# Save first version
with open('v1.png', 'wb') as f:
f.write(response1.candidates[0].content.parts[0].inline_data.data)
# Refine
img = PIL.Image.open('v1.png')
response2 = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[
'Add flying vehicles and neon signs',
img
]
)
```
### Negative Prompts (Indirect)
```python
# Instead of "no blur", be specific about what you want
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='A crystal clear, sharp photograph of a diamond ring with perfect focus and high detail'
)
```
### Consistent Style Across Images
```python
base_prompt = "Digital art, vibrant colors, cel-shaded style, clean lines"
prompts = [
f"{base_prompt}, a warrior character",
f"{base_prompt}, a mage character",
f"{base_prompt}, a rogue character"
]
for i, prompt in enumerate(prompts):
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=prompt
)
# Save each character
```
## Safety Settings
### Configure Safety Filters
```python
config = types.GenerateContentConfig(
response_modalities=['image'],
safety_settings=[
types.SafetySetting(
category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
),
types.SafetySetting(
category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
)
]
)
```
### Available Categories
- `HARM_CATEGORY_HATE_SPEECH`
- `HARM_CATEGORY_DANGEROUS_CONTENT`
- `HARM_CATEGORY_HARASSMENT`
- `HARM_CATEGORY_SEXUALLY_EXPLICIT`
### Thresholds
- `BLOCK_NONE`: No blocking
- `BLOCK_LOW_AND_ABOVE`: Block low probability and above
- `BLOCK_MEDIUM_AND_ABOVE`: Block medium and above (default)
- `BLOCK_ONLY_HIGH`: Block only high probability
## Common Use Cases
### 1. Marketing Assets
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='''Professional product photography:
- Sleek smartphone on minimalist white surface
- Dramatic side lighting creating subtle shadows
- Shallow depth of field, crisp focus
- Clean, modern aesthetic
- 4K quality
''',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='4:3'
)
)
```
### 2. Concept Art
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='''Fantasy concept art:
- Ancient floating islands connected by chains
- Waterfalls cascading into clouds below
- Magical crystals glowing on the islands
- Epic scale, dramatic lighting
- Detailed digital painting style
''',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='16:9'
)
)
```
### 3. Social Media Graphics
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='''Instagram post design:
- Pastel gradient background (pink to blue)
- Motivational quote layout
- Modern minimalist style
- Clean typography
- Mobile-friendly composition
''',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='1:1'
)
)
```
### 4. Illustration
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='''Children's book illustration:
- Friendly cartoon dragon reading a book
- Bright, cheerful colors
- Soft, rounded shapes
- Whimsical forest background
- Warm, inviting atmosphere
''',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='4:3'
)
)
```
### 5. UI/UX Mockups
```python
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents='''Modern mobile app interface:
- Clean dashboard design
- Card-based layout
- Soft shadows and gradients
- Contemporary color scheme (blue and white)
- Professional fintech aesthetic
''',
config=types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='9:16'
)
)
```
## Best Practices
### Prompt Quality
1. **Be specific**: More detail = better results
2. **Order matters**: Most important elements first
3. **Use examples**: Reference known styles or artists
4. **Avoid contradictions**: Don't ask for opposing styles
5. **Test and iterate**: Refine prompts based on results
### File Management
```python
# Save with descriptive names
timestamp = int(time.time())
filename = f'generated_{timestamp}_{aspect_ratio}.png'
with open(filename, 'wb') as f:
f.write(image_data)
```
### Cost Optimization
**Token costs**:
- 1 image: 1,290 tokens = $0.00129 (Flash Image at $1/1M)
- 10 images: 12,900 tokens = $0.0129
- 100 images: 129,000 tokens = $0.129
**Strategies**:
- Generate fewer iterations
- Use text modality first to validate concept
- Batch similar requests
- Cache prompts for consistent style
## Error Handling
### Safety Filter Blocking
```python
try:
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=prompt
)
except Exception as e:
# Check block reason
if hasattr(e, 'prompt_feedback'):
print(f"Blocked: {e.prompt_feedback.block_reason}")
# Modify prompt and retry
```
### Token Limit Exceeded
```python
# Keep prompts concise
if len(prompt) > 1000:
# Truncate or simplify
prompt = prompt[:1000]
```
## Limitations
- Maximum 3 input images for composition
- Text rendering limited (25 chars max)
- No video or animation generation
- Regional restrictions (child images in EEA, CH, UK)
- Optimal language support: English, Spanish (Mexico), Japanese, Mandarin, Hindi
- No real-time generation
- Cannot perfectly replicate specific people or copyrighted characters
## Troubleshooting
### aspect_ratio Parameter Error
**Error**: `Extra inputs are not permitted [type=extra_forbidden, input_value='1:1', input_type=str]`
**Cause**: The `aspect_ratio` parameter must be nested inside an `image_config` object, not passed directly to `GenerateContentConfig`.
**Incorrect Usage**:
```python
# ❌ This will fail
config = types.GenerateContentConfig(
response_modalities=['image'],
aspect_ratio='16:9' # Wrong - not a direct parameter
)
```
**Correct Usage**:
```python
# ✅ Correct implementation
config = types.GenerateContentConfig(
response_modalities=['Image'], # Note: Capital 'I'
image_config=types.ImageConfig(
aspect_ratio='16:9'
)
)
```
### Response Modality Case Sensitivity
The `response_modalities` parameter expects capital case values:
- ✅ Correct: `['Image']`, `['Text']`, `['Image', 'Text']`
- ❌ Wrong: `['image']`, `['text']`