Initial commit

2025-11-30 08:48:52 +08:00
commit 6ec3196ecc
434 changed files with 125248 additions and 0 deletions
--- a/skills/ai-multimodal/references/image-generation.md
+++ b/skills/ai-multimodal/references/image-generation.md
@@ -0,0 +1,558 @@
+# Image Generation Reference
+
+Comprehensive guide for image creation, editing, and composition using Gemini API.
+
+## Core Capabilities
+
+- **Text-to-Image**: Generate images from text prompts
+- **Image Editing**: Modify existing images with text instructions
+- **Multi-Image Composition**: Combine up to 3 images
+- **Iterative Refinement**: Refine images conversationally
+- **Aspect Ratios**: Multiple formats (1:1, 16:9, 9:16, 4:3, 3:4)
+- **Style Control**: Control artistic style and quality
+- **Text in Images**: Limited text rendering (max 25 chars)
+
+## Model
+
+**gemini-2.5-flash-image** - Specialized for image generation
+- Input tokens: 65,536
+- Output tokens: 32,768
+- Knowledge cutoff: June 2025
+- Supports: Text and image inputs, image outputs
+
+## Quick Start
+
+### Basic Generation
+
+```python
+from google import genai
+from google.genai import types
+import os
+
+client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A serene mountain landscape at sunset with snow-capped peaks',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+
+# Save image
+for i, part in enumerate(response.candidates[0].content.parts):
+    if part.inline_data:
+        with open(f'output-{i}.png', 'wb') as f:
+            f.write(part.inline_data.data)
+```
+
+## Aspect Ratios
+
+| Ratio | Resolution | Use Case | Token Cost |
+|-------|-----------|----------|------------|
+| 1:1 | 1024×1024 | Social media, avatars | 1290 |
+| 16:9 | 1344×768 | Landscapes, banners | 1290 |
+| 9:16 | 768×1344 | Mobile, portraits | 1290 |
+| 4:3 | 1152×896 | Traditional media | 1290 |
+| 3:4 | 896×1152 | Vertical posters | 1290 |
+
+All ratios cost the same: 1,290 tokens per image.
+
+## Response Modalities
+
+### Image Only
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    aspect_ratio='1:1'
+)
+```
+
+### Text Only (No Image)
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['text']
+)
+# Returns text description instead of generating image
+```
+
+### Both Image and Text
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image', 'text'],
+    aspect_ratio='16:9'
+)
+# Returns both generated image and description
+```
+
+## Image Editing
+
+### Modify Existing Image
+
+```python
+import PIL.Image
+
+# Load original
+img = PIL.Image.open('original.png')
+
+# Edit with instructions
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add a red balloon floating in the sky',
+        img
+    ],
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### Style Transfer
+
+```python
+img = PIL.Image.open('photo.jpg')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Transform this into an oil painting style',
+        img
+    ]
+)
+```
+
+### Object Addition/Removal
+
+```python
+# Add object
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add a vintage car parked on the street',
+        img
+    ]
+)
+
+# Remove object
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Remove the person on the left side',
+        img
+    ]
+)
+```
+
+## Multi-Image Composition
+
+### Combine Multiple Images
+
+```python
+img1 = PIL.Image.open('background.png')
+img2 = PIL.Image.open('foreground.png')
+img3 = PIL.Image.open('overlay.png')
+
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Combine these images into a cohesive scene',
+        img1,
+        img2,
+        img3
+    ],
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+**Note**: Recommended maximum 3 input images for best results.
+
+## Prompt Engineering
+
+### Effective Prompt Structure
+
+**Three key elements**:
+1. **Subject**: What to generate
+2. **Context**: Environmental setting
+3. **Style**: Artistic treatment
+
+**Example**: "A robot [subject] in a futuristic city [context], cyberpunk style with neon lighting [style]"
+
+### Quality Modifiers
+
+**Technical terms**:
+- "4K", "8K", "high resolution"
+- "HDR", "high dynamic range"
+- "professional photography"
+- "studio lighting"
+- "ultra detailed"
+
+**Camera settings**:
+- "35mm lens", "50mm lens"
+- "shallow depth of field"
+- "wide angle shot"
+- "macro photography"
+- "golden hour lighting"
+
+### Style Keywords
+
+**Art styles**:
+- "oil painting", "watercolor", "sketch"
+- "digital art", "concept art"
+- "photorealistic", "hyperrealistic"
+- "minimalist", "abstract"
+- "cyberpunk", "steampunk", "fantasy"
+
+**Mood and atmosphere**:
+- "dramatic lighting", "soft lighting"
+- "moody", "bright and cheerful"
+- "mysterious", "whimsical"
+- "dark and gritty", "pastel colors"
+
+### Subject Description
+
+**Be specific**:
+- ❌ "A cat"
+- ✅ "A fluffy orange tabby cat with green eyes"
+
+**Add context**:
+- ❌ "A building"
+- ✅ "A modern glass skyscraper reflecting sunset clouds"
+
+**Include details**:
+- ❌ "A person"
+- ✅ "A young woman in a red dress holding an umbrella"
+
+### Composition and Framing
+
+**Camera angles**:
+- "bird's eye view", "aerial shot"
+- "low angle", "high angle"
+- "close-up", "wide shot"
+- "centered composition"
+- "rule of thirds"
+
+**Perspective**:
+- "first person view"
+- "third person perspective"
+- "isometric view"
+- "forced perspective"
+
+### Text in Images
+
+**Limitations**:
+- Maximum 25 characters total
+- Up to 3 distinct text phrases
+- Works best with simple text
+
+**Best practices**:
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A vintage poster with bold text "EXPLORE" at the top, mountain landscape, retro 1950s style'
+)
+```
+
+**Font control**:
+- "bold sans-serif title"
+- "handwritten script"
+- "vintage letterpress"
+- "modern minimalist font"
+
+## Advanced Techniques
+
+### Iterative Refinement
+
+```python
+# Initial generation
+response1 = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A futuristic city skyline'
+)
+
+# Save first version
+with open('v1.png', 'wb') as f:
+    f.write(response1.candidates[0].content.parts[0].inline_data.data)
+
+# Refine
+img = PIL.Image.open('v1.png')
+response2 = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents=[
+        'Add flying vehicles and neon signs',
+        img
+    ]
+)
+```
+
+### Negative Prompts (Indirect)
+
+```python
+# Instead of "no blur", be specific about what you want
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='A crystal clear, sharp photograph of a diamond ring with perfect focus and high detail'
+)
+```
+
+### Consistent Style Across Images
+
+```python
+base_prompt = "Digital art, vibrant colors, cel-shaded style, clean lines"
+
+prompts = [
+    f"{base_prompt}, a warrior character",
+    f"{base_prompt}, a mage character",
+    f"{base_prompt}, a rogue character"
+]
+
+for i, prompt in enumerate(prompts):
+    response = client.models.generate_content(
+        model='gemini-2.5-flash-image',
+        contents=prompt
+    )
+    # Save each character
+```
+
+## Safety Settings
+
+### Configure Safety Filters
+
+```python
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    safety_settings=[
+        types.SafetySetting(
+            category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
+            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
+        ),
+        types.SafetySetting(
+            category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
+            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
+        )
+    ]
+)
+```
+
+### Available Categories
+
+- `HARM_CATEGORY_HATE_SPEECH`
+- `HARM_CATEGORY_DANGEROUS_CONTENT`
+- `HARM_CATEGORY_HARASSMENT`
+- `HARM_CATEGORY_SEXUALLY_EXPLICIT`
+
+### Thresholds
+
+- `BLOCK_NONE`: No blocking
+- `BLOCK_LOW_AND_ABOVE`: Block low probability and above
+- `BLOCK_MEDIUM_AND_ABOVE`: Block medium and above (default)
+- `BLOCK_ONLY_HIGH`: Block only high probability
+
+## Common Use Cases
+
+### 1. Marketing Assets
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Professional product photography:
+    - Sleek smartphone on minimalist white surface
+    - Dramatic side lighting creating subtle shadows
+    - Shallow depth of field, crisp focus
+    - Clean, modern aesthetic
+    - 4K quality
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='4:3'
+    )
+)
+```
+
+### 2. Concept Art
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Fantasy concept art:
+    - Ancient floating islands connected by chains
+    - Waterfalls cascading into clouds below
+    - Magical crystals glowing on the islands
+    - Epic scale, dramatic lighting
+    - Detailed digital painting style
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### 3. Social Media Graphics
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Instagram post design:
+    - Pastel gradient background (pink to blue)
+    - Motivational quote layout
+    - Modern minimalist style
+    - Clean typography
+    - Mobile-friendly composition
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='1:1'
+    )
+)
+```
+
+### 4. Illustration
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Children's book illustration:
+    - Friendly cartoon dragon reading a book
+    - Bright, cheerful colors
+    - Soft, rounded shapes
+    - Whimsical forest background
+    - Warm, inviting atmosphere
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='4:3'
+    )
+)
+```
+
+### 5. UI/UX Mockups
+
+```python
+response = client.models.generate_content(
+    model='gemini-2.5-flash-image',
+    contents='''Modern mobile app interface:
+    - Clean dashboard design
+    - Card-based layout
+    - Soft shadows and gradients
+    - Contemporary color scheme (blue and white)
+    - Professional fintech aesthetic
+    ''',
+    config=types.GenerateContentConfig(
+        response_modalities=['image'],
+        aspect_ratio='9:16'
+    )
+)
+```
+
+## Best Practices
+
+### Prompt Quality
+
+1. **Be specific**: More detail = better results
+2. **Order matters**: Most important elements first
+3. **Use examples**: Reference known styles or artists
+4. **Avoid contradictions**: Don't ask for opposing styles
+5. **Test and iterate**: Refine prompts based on results
+
+### File Management
+
+```python
+# Save with descriptive names
+timestamp = int(time.time())
+filename = f'generated_{timestamp}_{aspect_ratio}.png'
+
+with open(filename, 'wb') as f:
+    f.write(image_data)
+```
+
+### Cost Optimization
+
+**Token costs**:
+- 1 image: 1,290 tokens = $0.00129 (Flash Image at $1/1M)
+- 10 images: 12,900 tokens = $0.0129
+- 100 images: 129,000 tokens = $0.129
+
+**Strategies**:
+- Generate fewer iterations
+- Use text modality first to validate concept
+- Batch similar requests
+- Cache prompts for consistent style
+
+## Error Handling
+
+### Safety Filter Blocking
+
+```python
+try:
+    response = client.models.generate_content(
+        model='gemini-2.5-flash-image',
+        contents=prompt
+    )
+except Exception as e:
+    # Check block reason
+    if hasattr(e, 'prompt_feedback'):
+        print(f"Blocked: {e.prompt_feedback.block_reason}")
+        # Modify prompt and retry
+```
+
+### Token Limit Exceeded
+
+```python
+# Keep prompts concise
+if len(prompt) > 1000:
+    # Truncate or simplify
+    prompt = prompt[:1000]
+```
+
+## Limitations
+
+- Maximum 3 input images for composition
+- Text rendering limited (25 chars max)
+- No video or animation generation
+- Regional restrictions (child images in EEA, CH, UK)
+- Optimal language support: English, Spanish (Mexico), Japanese, Mandarin, Hindi
+- No real-time generation
+- Cannot perfectly replicate specific people or copyrighted characters
+
+## Troubleshooting
+
+### aspect_ratio Parameter Error
+
+**Error**: `Extra inputs are not permitted [type=extra_forbidden, input_value='1:1', input_type=str]`
+
+**Cause**: The `aspect_ratio` parameter must be nested inside an `image_config` object, not passed directly to `GenerateContentConfig`.
+
+**Incorrect Usage**:
+```python
+# ❌ This will fail
+config = types.GenerateContentConfig(
+    response_modalities=['image'],
+    aspect_ratio='16:9'  # Wrong - not a direct parameter
+)
+```
+
+**Correct Usage**:
+```python
+# ✅ Correct implementation
+config = types.GenerateContentConfig(
+    response_modalities=['Image'],  # Note: Capital 'I'
+    image_config=types.ImageConfig(
+        aspect_ratio='16:9'
+    )
+)
+```
+
+### Response Modality Case Sensitivity
+
+The `response_modalities` parameter expects capital case values:
+- ✅ Correct: `['Image']`, `['Text']`, `['Image', 'Text']`
+- ❌ Wrong: `['image']`, `['text']`