Files
gh-rafaelcalleja-claude-mar…/skills/ai-multimodal/references/image-generation.md
2025-11-30 08:48:52 +08:00

13 KiB
Raw Blame History

Image Generation Reference

Comprehensive guide for image creation, editing, and composition using Gemini API.

Core Capabilities

  • Text-to-Image: Generate images from text prompts
  • Image Editing: Modify existing images with text instructions
  • Multi-Image Composition: Combine up to 3 images
  • Iterative Refinement: Refine images conversationally
  • Aspect Ratios: Multiple formats (1:1, 16:9, 9:16, 4:3, 3:4)
  • Style Control: Control artistic style and quality
  • Text in Images: Limited text rendering (max 25 chars)

Model

gemini-2.5-flash-image - Specialized for image generation

  • Input tokens: 65,536
  • Output tokens: 32,768
  • Knowledge cutoff: June 2025
  • Supports: Text and image inputs, image outputs

Quick Start

Basic Generation

from google import genai
from google.genai import types
import os

client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='A serene mountain landscape at sunset with snow-capped peaks',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='16:9'
    )
)

# Save image
for i, part in enumerate(response.candidates[0].content.parts):
    if part.inline_data:
        with open(f'output-{i}.png', 'wb') as f:
            f.write(part.inline_data.data)

Aspect Ratios

Ratio Resolution Use Case Token Cost
1:1 1024×1024 Social media, avatars 1290
16:9 1344×768 Landscapes, banners 1290
9:16 768×1344 Mobile, portraits 1290
4:3 1152×896 Traditional media 1290
3:4 896×1152 Vertical posters 1290

All ratios cost the same: 1,290 tokens per image.

Response Modalities

Image Only

config = types.GenerateContentConfig(
    response_modalities=['image'],
    aspect_ratio='1:1'
)

Text Only (No Image)

config = types.GenerateContentConfig(
    response_modalities=['text']
)
# Returns text description instead of generating image

Both Image and Text

config = types.GenerateContentConfig(
    response_modalities=['image', 'text'],
    aspect_ratio='16:9'
)
# Returns both generated image and description

Image Editing

Modify Existing Image

import PIL.Image

# Load original
img = PIL.Image.open('original.png')

# Edit with instructions
response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Add a red balloon floating in the sky',
        img
    ],
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='16:9'
    )
)

Style Transfer

img = PIL.Image.open('photo.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Transform this into an oil painting style',
        img
    ]
)

Object Addition/Removal

# Add object
response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Add a vintage car parked on the street',
        img
    ]
)

# Remove object
response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Remove the person on the left side',
        img
    ]
)

Multi-Image Composition

Combine Multiple Images

img1 = PIL.Image.open('background.png')
img2 = PIL.Image.open('foreground.png')
img3 = PIL.Image.open('overlay.png')

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Combine these images into a cohesive scene',
        img1,
        img2,
        img3
    ],
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='16:9'
    )
)

Note: Recommended maximum 3 input images for best results.

Prompt Engineering

Effective Prompt Structure

Three key elements:

  1. Subject: What to generate
  2. Context: Environmental setting
  3. Style: Artistic treatment

Example: "A robot [subject] in a futuristic city [context], cyberpunk style with neon lighting [style]"

Quality Modifiers

Technical terms:

  • "4K", "8K", "high resolution"
  • "HDR", "high dynamic range"
  • "professional photography"
  • "studio lighting"
  • "ultra detailed"

Camera settings:

  • "35mm lens", "50mm lens"
  • "shallow depth of field"
  • "wide angle shot"
  • "macro photography"
  • "golden hour lighting"

Style Keywords

Art styles:

  • "oil painting", "watercolor", "sketch"
  • "digital art", "concept art"
  • "photorealistic", "hyperrealistic"
  • "minimalist", "abstract"
  • "cyberpunk", "steampunk", "fantasy"

Mood and atmosphere:

  • "dramatic lighting", "soft lighting"
  • "moody", "bright and cheerful"
  • "mysterious", "whimsical"
  • "dark and gritty", "pastel colors"

Subject Description

Be specific:

  • "A cat"
  • "A fluffy orange tabby cat with green eyes"

Add context:

  • "A building"
  • "A modern glass skyscraper reflecting sunset clouds"

Include details:

  • "A person"
  • "A young woman in a red dress holding an umbrella"

Composition and Framing

Camera angles:

  • "bird's eye view", "aerial shot"
  • "low angle", "high angle"
  • "close-up", "wide shot"
  • "centered composition"
  • "rule of thirds"

Perspective:

  • "first person view"
  • "third person perspective"
  • "isometric view"
  • "forced perspective"

Text in Images

Limitations:

  • Maximum 25 characters total
  • Up to 3 distinct text phrases
  • Works best with simple text

Best practices:

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='A vintage poster with bold text "EXPLORE" at the top, mountain landscape, retro 1950s style'
)

Font control:

  • "bold sans-serif title"
  • "handwritten script"
  • "vintage letterpress"
  • "modern minimalist font"

Advanced Techniques

Iterative Refinement

# Initial generation
response1 = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='A futuristic city skyline'
)

# Save first version
with open('v1.png', 'wb') as f:
    f.write(response1.candidates[0].content.parts[0].inline_data.data)

# Refine
img = PIL.Image.open('v1.png')
response2 = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents=[
        'Add flying vehicles and neon signs',
        img
    ]
)

Negative Prompts (Indirect)

# Instead of "no blur", be specific about what you want
response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='A crystal clear, sharp photograph of a diamond ring with perfect focus and high detail'
)

Consistent Style Across Images

base_prompt = "Digital art, vibrant colors, cel-shaded style, clean lines"

prompts = [
    f"{base_prompt}, a warrior character",
    f"{base_prompt}, a mage character",
    f"{base_prompt}, a rogue character"
]

for i, prompt in enumerate(prompts):
    response = client.models.generate_content(
        model='gemini-2.5-flash-image',
        contents=prompt
    )
    # Save each character

Safety Settings

Configure Safety Filters

config = types.GenerateContentConfig(
    response_modalities=['image'],
    safety_settings=[
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
        ),
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
            threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
        )
    ]
)

Available Categories

  • HARM_CATEGORY_HATE_SPEECH
  • HARM_CATEGORY_DANGEROUS_CONTENT
  • HARM_CATEGORY_HARASSMENT
  • HARM_CATEGORY_SEXUALLY_EXPLICIT

Thresholds

  • BLOCK_NONE: No blocking
  • BLOCK_LOW_AND_ABOVE: Block low probability and above
  • BLOCK_MEDIUM_AND_ABOVE: Block medium and above (default)
  • BLOCK_ONLY_HIGH: Block only high probability

Common Use Cases

1. Marketing Assets

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='''Professional product photography:
    - Sleek smartphone on minimalist white surface
    - Dramatic side lighting creating subtle shadows
    - Shallow depth of field, crisp focus
    - Clean, modern aesthetic
    - 4K quality
    ''',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='4:3'
    )
)

2. Concept Art

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='''Fantasy concept art:
    - Ancient floating islands connected by chains
    - Waterfalls cascading into clouds below
    - Magical crystals glowing on the islands
    - Epic scale, dramatic lighting
    - Detailed digital painting style
    ''',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='16:9'
    )
)

3. Social Media Graphics

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='''Instagram post design:
    - Pastel gradient background (pink to blue)
    - Motivational quote layout
    - Modern minimalist style
    - Clean typography
    - Mobile-friendly composition
    ''',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='1:1'
    )
)

4. Illustration

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='''Children's book illustration:
    - Friendly cartoon dragon reading a book
    - Bright, cheerful colors
    - Soft, rounded shapes
    - Whimsical forest background
    - Warm, inviting atmosphere
    ''',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='4:3'
    )
)

5. UI/UX Mockups

response = client.models.generate_content(
    model='gemini-2.5-flash-image',
    contents='''Modern mobile app interface:
    - Clean dashboard design
    - Card-based layout
    - Soft shadows and gradients
    - Contemporary color scheme (blue and white)
    - Professional fintech aesthetic
    ''',
    config=types.GenerateContentConfig(
        response_modalities=['image'],
        aspect_ratio='9:16'
    )
)

Best Practices

Prompt Quality

  1. Be specific: More detail = better results
  2. Order matters: Most important elements first
  3. Use examples: Reference known styles or artists
  4. Avoid contradictions: Don't ask for opposing styles
  5. Test and iterate: Refine prompts based on results

File Management

# Save with descriptive names
timestamp = int(time.time())
filename = f'generated_{timestamp}_{aspect_ratio}.png'

with open(filename, 'wb') as f:
    f.write(image_data)

Cost Optimization

Token costs:

  • 1 image: 1,290 tokens = $0.00129 (Flash Image at $1/1M)
  • 10 images: 12,900 tokens = $0.0129
  • 100 images: 129,000 tokens = $0.129

Strategies:

  • Generate fewer iterations
  • Use text modality first to validate concept
  • Batch similar requests
  • Cache prompts for consistent style

Error Handling

Safety Filter Blocking

try:
    response = client.models.generate_content(
        model='gemini-2.5-flash-image',
        contents=prompt
    )
except Exception as e:
    # Check block reason
    if hasattr(e, 'prompt_feedback'):
        print(f"Blocked: {e.prompt_feedback.block_reason}")
        # Modify prompt and retry

Token Limit Exceeded

# Keep prompts concise
if len(prompt) > 1000:
    # Truncate or simplify
    prompt = prompt[:1000]

Limitations

  • Maximum 3 input images for composition
  • Text rendering limited (25 chars max)
  • No video or animation generation
  • Regional restrictions (child images in EEA, CH, UK)
  • Optimal language support: English, Spanish (Mexico), Japanese, Mandarin, Hindi
  • No real-time generation
  • Cannot perfectly replicate specific people or copyrighted characters

Troubleshooting

aspect_ratio Parameter Error

Error: Extra inputs are not permitted [type=extra_forbidden, input_value='1:1', input_type=str]

Cause: The aspect_ratio parameter must be nested inside an image_config object, not passed directly to GenerateContentConfig.

Incorrect Usage:

# ❌ This will fail
config = types.GenerateContentConfig(
    response_modalities=['image'],
    aspect_ratio='16:9'  # Wrong - not a direct parameter
)

Correct Usage:

# ✅ Correct implementation
config = types.GenerateContentConfig(
    response_modalities=['Image'],  # Note: Capital 'I'
    image_config=types.ImageConfig(
        aspect_ratio='16:9'
    )
)

Response Modality Case Sensitivity

The response_modalities parameter expects capital case values:

  • Correct: ['Image'], ['Text'], ['Image', 'Text']
  • Wrong: ['image'], ['text']