Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:03:52 +08:00
commit 0b586b3216
42 changed files with 5241 additions and 0 deletions

View File

@@ -0,0 +1,190 @@
---
name: gemini-imagegen
description: Generate and edit images using the Gemini API (Nano Banana). Use this skill when creating images from text prompts, editing existing images, applying style transfers, generating logos with text, creating stickers, product mockups, or any image generation/manipulation task. Supports text-to-image, image editing, multi-turn refinement, and composition from multiple reference images.
---
# Gemini Image Generation (Nano Banana)
Generate and edit images using Google's Gemini API. The environment variable `GEMINI_API_KEY` must be set.
## Available Models
| Model | Alias | Resolution | Best For |
|-------|-------|------------|----------|
| `gemini-2.5-flash-image` | Nano Banana | 1024px | Speed, high-volume tasks |
| `gemini-3-pro-image-preview` | Nano Banana Pro | Up to 4K | Professional assets, complex instructions, text rendering |
## Quick Start Scripts
### Text-to-Image
```bash
python scripts/generate_image.py "A cat wearing a wizard hat" output.png
```
### Edit Existing Image
```bash
python scripts/edit_image.py input.png "Add a rainbow in the background" output.png
```
### Multi-Turn Chat (Iterative Refinement)
```bash
python scripts/multi_turn_chat.py
```
## Core API Pattern
All image generation uses the `generateContent` endpoint with `responseModalities: ["TEXT", "IMAGE"]`:
```python
import os
import base64
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=["Your prompt here"],
)
for part in response.parts:
if part.text:
print(part.text)
elif part.inline_data:
image = part.as_image()
image.save("output.png")
```
## Image Configuration Options
Control output with `image_config`:
```python
from google.genai import types
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=[prompt],
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
image_config=types.ImageConfig(
aspect_ratio="16:9", # 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
image_size="2K" # 1K, 2K, 4K (Pro only for 4K)
),
)
)
```
## Editing Images
Pass existing images with text prompts:
```python
from PIL import Image
img = Image.open("input.png")
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=["Add a sunset to this scene", img],
)
```
## Multi-Turn Refinement
Use chat for iterative editing:
```python
from google.genai import types
chat = client.chats.create(
model="gemini-2.5-flash-image",
config=types.GenerateContentConfig(response_modalities=['TEXT', 'IMAGE'])
)
response = chat.send_message("Create a logo for 'Acme Corp'")
# Save first image...
response = chat.send_message("Make the text bolder and add a blue gradient")
# Save refined image...
```
## Prompting Best Practices
### Photorealistic Scenes
Include camera details: lens type, lighting, angle, mood.
> "A photorealistic close-up portrait, 85mm lens, soft golden hour light, shallow depth of field"
### Stylized Art
Specify style explicitly:
> "A kawaii-style sticker of a happy red panda, bold outlines, cel-shading, white background"
### Text in Images
Be explicit about font style and placement. Use `gemini-3-pro-image-preview` for best results:
> "Create a logo with text 'Daily Grind' in clean sans-serif, black and white, coffee bean motif"
### Product Mockups
Describe lighting setup and surface:
> "Studio-lit product photo on polished concrete, three-point softbox setup, 45-degree angle"
### Landing Pages
Specify layout structure, color scheme, and target audience:
> "Modern landing page hero section, gradient background from deep purple to blue, centered headline with CTA button, clean minimalist design, SaaS product"
> "Landing page for fitness app, energetic layout with workout photos, bright orange and black color scheme, mobile-first design, prominent download buttons"
### Website Design Ideas
Describe overall aesthetic, navigation style, and content hierarchy:
> "E-commerce homepage wireframe, grid layout for products, sticky navigation bar, warm earth tones, plenty of whitespace, professional photography style"
> "Portfolio website for photographer, full-screen image galleries, dark mode interface, elegant serif typography, minimal UI elements to highlight work"
> "Tech startup homepage, glassmorphism design trend, floating cards, neon accent colors on dark background, modern illustrations, hero section with product demo"
## Advanced Features (Pro Model Only)
### Google Search Grounding
Generate images based on real-time data:
```python
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=["Visualize today's weather in Tokyo as an infographic"],
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
tools=[{"google_search": {}}]
)
)
```
### Multiple Reference Images (Up to 14)
Combine elements from multiple sources:
```python
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=[
"Create a group photo of these people in an office",
Image.open("person1.png"),
Image.open("person2.png"),
Image.open("person3.png"),
],
)
```
## REST API (curl)
```bash
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "A serene mountain landscape"}]}]
}' | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' | base64 --decode > output.png
```
## Notes
- All generated images include SynthID watermarks
- Image-only mode (`responseModalities: ["IMAGE"]`) won't work with Google Search grounding
- For editing, describe changes conversationally—the model understands semantic masking

View File

@@ -0,0 +1,157 @@
#!/usr/bin/env python3
"""
Compose multiple images into a new image using Gemini API.
Usage:
python compose_images.py "instruction" output.png image1.png [image2.png ...]
Examples:
python compose_images.py "Create a group photo of these people" group.png person1.png person2.png
python compose_images.py "Put the cat from the first image on the couch from the second" result.png cat.png couch.png
python compose_images.py "Apply the art style from the first image to the scene in the second" styled.png style.png photo.png
Note: Supports up to 14 reference images (Gemini 3 Pro only).
Environment:
GEMINI_API_KEY - Required API key
"""
import argparse
import os
import sys
from PIL import Image
from google import genai
from google.genai import types
def compose_images(
instruction: str,
output_path: str,
image_paths: list[str],
model: str = "gemini-3-pro-image-preview",
aspect_ratio: str | None = None,
image_size: str | None = None,
) -> str | None:
"""Compose multiple images based on instructions.
Args:
instruction: Text description of how to combine images
output_path: Path to save the result
image_paths: List of input image paths (up to 14)
model: Gemini model to use (pro recommended)
aspect_ratio: Output aspect ratio
image_size: Output resolution
Returns:
Any text response from the model, or None
"""
api_key = os.environ.get("GEMINI_API_KEY")
if not api_key:
raise EnvironmentError("GEMINI_API_KEY environment variable not set")
if len(image_paths) > 14:
raise ValueError("Maximum 14 reference images supported")
if len(image_paths) < 1:
raise ValueError("At least one image is required")
# Verify all images exist
for path in image_paths:
if not os.path.exists(path):
raise FileNotFoundError(f"Image not found: {path}")
client = genai.Client(api_key=api_key)
# Load images
images = [Image.open(path) for path in image_paths]
# Build contents: instruction first, then images
contents = [instruction] + images
# Build config
config_kwargs = {"response_modalities": ["TEXT", "IMAGE"]}
image_config_kwargs = {}
if aspect_ratio:
image_config_kwargs["aspect_ratio"] = aspect_ratio
if image_size:
image_config_kwargs["image_size"] = image_size
if image_config_kwargs:
config_kwargs["image_config"] = types.ImageConfig(**image_config_kwargs)
config = types.GenerateContentConfig(**config_kwargs)
response = client.models.generate_content(
model=model,
contents=contents,
config=config,
)
text_response = None
image_saved = False
for part in response.parts:
if part.text is not None:
text_response = part.text
elif part.inline_data is not None:
image = part.as_image()
image.save(output_path)
image_saved = True
if not image_saved:
raise RuntimeError("No image was generated.")
return text_response
def main():
parser = argparse.ArgumentParser(
description="Compose multiple images using Gemini API",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
parser.add_argument("instruction", help="Composition instruction")
parser.add_argument("output", help="Output file path")
parser.add_argument("images", nargs="+", help="Input images (up to 14)")
parser.add_argument(
"--model", "-m",
default="gemini-3-pro-image-preview",
choices=["gemini-2.5-flash-image", "gemini-3-pro-image-preview"],
help="Model to use (pro recommended for composition)"
)
parser.add_argument(
"--aspect", "-a",
choices=["1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"],
help="Output aspect ratio"
)
parser.add_argument(
"--size", "-s",
choices=["1K", "2K", "4K"],
help="Output resolution"
)
args = parser.parse_args()
try:
text = compose_images(
instruction=args.instruction,
output_path=args.output,
image_paths=args.images,
model=args.model,
aspect_ratio=args.aspect,
image_size=args.size,
)
print(f"Composed image saved to: {args.output}")
if text:
print(f"Model response: {text}")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,144 @@
#!/usr/bin/env python3
"""
Edit existing images using Gemini API.
Usage:
python edit_image.py input.png "edit instruction" output.png [options]
Examples:
python edit_image.py photo.png "Add a rainbow in the sky" edited.png
python edit_image.py room.jpg "Change the sofa to red leather" room_edited.jpg
python edit_image.py portrait.png "Make it look like a Van Gogh painting" artistic.png --model gemini-3-pro-image-preview
Environment:
GEMINI_API_KEY - Required API key
"""
import argparse
import os
import sys
from PIL import Image
from google import genai
from google.genai import types
def edit_image(
input_path: str,
instruction: str,
output_path: str,
model: str = "gemini-2.5-flash-image",
aspect_ratio: str | None = None,
image_size: str | None = None,
) -> str | None:
"""Edit an existing image based on text instructions.
Args:
input_path: Path to the input image
instruction: Text description of edits to make
output_path: Path to save the edited image
model: Gemini model to use
aspect_ratio: Output aspect ratio
image_size: Output resolution
Returns:
Any text response from the model, or None
"""
api_key = os.environ.get("GEMINI_API_KEY")
if not api_key:
raise EnvironmentError("GEMINI_API_KEY environment variable not set")
if not os.path.exists(input_path):
raise FileNotFoundError(f"Input image not found: {input_path}")
client = genai.Client(api_key=api_key)
# Load input image
input_image = Image.open(input_path)
# Build config
config_kwargs = {"response_modalities": ["TEXT", "IMAGE"]}
image_config_kwargs = {}
if aspect_ratio:
image_config_kwargs["aspect_ratio"] = aspect_ratio
if image_size:
image_config_kwargs["image_size"] = image_size
if image_config_kwargs:
config_kwargs["image_config"] = types.ImageConfig(**image_config_kwargs)
config = types.GenerateContentConfig(**config_kwargs)
response = client.models.generate_content(
model=model,
contents=[instruction, input_image],
config=config,
)
text_response = None
image_saved = False
for part in response.parts:
if part.text is not None:
text_response = part.text
elif part.inline_data is not None:
image = part.as_image()
image.save(output_path)
image_saved = True
if not image_saved:
raise RuntimeError("No image was generated. Check your instruction and try again.")
return text_response
def main():
parser = argparse.ArgumentParser(
description="Edit images using Gemini API",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
parser.add_argument("input", help="Input image path")
parser.add_argument("instruction", help="Edit instruction")
parser.add_argument("output", help="Output file path")
parser.add_argument(
"--model", "-m",
default="gemini-2.5-flash-image",
choices=["gemini-2.5-flash-image", "gemini-3-pro-image-preview"],
help="Model to use (default: gemini-2.5-flash-image)"
)
parser.add_argument(
"--aspect", "-a",
choices=["1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"],
help="Output aspect ratio"
)
parser.add_argument(
"--size", "-s",
choices=["1K", "2K", "4K"],
help="Output resolution"
)
args = parser.parse_args()
try:
text = edit_image(
input_path=args.input,
instruction=args.instruction,
output_path=args.output,
model=args.model,
aspect_ratio=args.aspect,
image_size=args.size,
)
print(f"Edited image saved to: {args.output}")
if text:
print(f"Model response: {text}")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,263 @@
"""
Gemini Image Generation Library
A simple Python library for generating and editing images with the Gemini API.
Usage:
from gemini_images import GeminiImageGenerator
gen = GeminiImageGenerator()
gen.generate("A sunset over mountains", "sunset.png")
gen.edit("input.png", "Add clouds", "output.png")
Environment:
GEMINI_API_KEY - Required API key
"""
import os
from pathlib import Path
from typing import Literal
from PIL import Image
from google import genai
from google.genai import types
AspectRatio = Literal["1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"]
ImageSize = Literal["1K", "2K", "4K"]
Model = Literal["gemini-2.5-flash-image", "gemini-3-pro-image-preview"]
class GeminiImageGenerator:
"""High-level interface for Gemini image generation."""
FLASH = "gemini-2.5-flash-image"
PRO = "gemini-3-pro-image-preview"
def __init__(self, api_key: str | None = None, model: Model = FLASH):
"""Initialize the generator.
Args:
api_key: Gemini API key (defaults to GEMINI_API_KEY env var)
model: Default model to use
"""
self.api_key = api_key or os.environ.get("GEMINI_API_KEY")
if not self.api_key:
raise EnvironmentError("GEMINI_API_KEY not set")
self.client = genai.Client(api_key=self.api_key)
self.model = model
def _build_config(
self,
aspect_ratio: AspectRatio | None = None,
image_size: ImageSize | None = None,
google_search: bool = False,
) -> types.GenerateContentConfig:
"""Build generation config."""
kwargs = {"response_modalities": ["TEXT", "IMAGE"]}
img_config = {}
if aspect_ratio:
img_config["aspect_ratio"] = aspect_ratio
if image_size:
img_config["image_size"] = image_size
if img_config:
kwargs["image_config"] = types.ImageConfig(**img_config)
if google_search:
kwargs["tools"] = [{"google_search": {}}]
return types.GenerateContentConfig(**kwargs)
def generate(
self,
prompt: str,
output: str | Path,
*,
model: Model | None = None,
aspect_ratio: AspectRatio | None = None,
image_size: ImageSize | None = None,
google_search: bool = False,
) -> tuple[Path, str | None]:
"""Generate an image from a text prompt.
Args:
prompt: Text description
output: Output file path
model: Override default model
aspect_ratio: Output aspect ratio
image_size: Output resolution
google_search: Enable Google Search grounding (Pro only)
Returns:
Tuple of (output path, optional text response)
"""
output = Path(output)
config = self._build_config(aspect_ratio, image_size, google_search)
response = self.client.models.generate_content(
model=model or self.model,
contents=[prompt],
config=config,
)
text = None
for part in response.parts:
if part.text:
text = part.text
elif part.inline_data:
part.as_image().save(output)
return output, text
def edit(
self,
input_image: str | Path | Image.Image,
instruction: str,
output: str | Path,
*,
model: Model | None = None,
aspect_ratio: AspectRatio | None = None,
image_size: ImageSize | None = None,
) -> tuple[Path, str | None]:
"""Edit an existing image.
Args:
input_image: Input image (path or PIL Image)
instruction: Edit instruction
output: Output file path
model: Override default model
aspect_ratio: Output aspect ratio
image_size: Output resolution
Returns:
Tuple of (output path, optional text response)
"""
output = Path(output)
if isinstance(input_image, (str, Path)):
input_image = Image.open(input_image)
config = self._build_config(aspect_ratio, image_size)
response = self.client.models.generate_content(
model=model or self.model,
contents=[instruction, input_image],
config=config,
)
text = None
for part in response.parts:
if part.text:
text = part.text
elif part.inline_data:
part.as_image().save(output)
return output, text
def compose(
self,
instruction: str,
images: list[str | Path | Image.Image],
output: str | Path,
*,
model: Model | None = None,
aspect_ratio: AspectRatio | None = None,
image_size: ImageSize | None = None,
) -> tuple[Path, str | None]:
"""Compose multiple images into one.
Args:
instruction: Composition instruction
images: List of input images (up to 14)
output: Output file path
model: Override default model (Pro recommended)
aspect_ratio: Output aspect ratio
image_size: Output resolution
Returns:
Tuple of (output path, optional text response)
"""
output = Path(output)
# Load images
loaded = []
for img in images:
if isinstance(img, (str, Path)):
loaded.append(Image.open(img))
else:
loaded.append(img)
config = self._build_config(aspect_ratio, image_size)
contents = [instruction] + loaded
response = self.client.models.generate_content(
model=model or self.PRO, # Pro recommended for composition
contents=contents,
config=config,
)
text = None
for part in response.parts:
if part.text:
text = part.text
elif part.inline_data:
part.as_image().save(output)
return output, text
def chat(self) -> "ImageChat":
"""Start an interactive chat session for iterative refinement."""
return ImageChat(self.client, self.model)
class ImageChat:
"""Multi-turn chat session for iterative image generation."""
def __init__(self, client: genai.Client, model: Model):
self.client = client
self.model = model
self._chat = client.chats.create(
model=model,
config=types.GenerateContentConfig(response_modalities=["TEXT", "IMAGE"]),
)
self.current_image: Image.Image | None = None
def send(
self,
message: str,
image: Image.Image | str | Path | None = None,
) -> tuple[Image.Image | None, str | None]:
"""Send a message and optionally an image.
Returns:
Tuple of (generated image or None, text response or None)
"""
contents = [message]
if image:
if isinstance(image, (str, Path)):
image = Image.open(image)
contents.append(image)
response = self._chat.send_message(contents)
text = None
img = None
for part in response.parts:
if part.text:
text = part.text
elif part.inline_data:
img = part.as_image()
self.current_image = img
return img, text
def reset(self):
"""Reset the chat session."""
self._chat = self.client.chats.create(
model=self.model,
config=types.GenerateContentConfig(response_modalities=["TEXT", "IMAGE"]),
)
self.current_image = None

View File

@@ -0,0 +1,133 @@
#!/usr/bin/env python3
"""
Generate images from text prompts using Gemini API.
Usage:
python generate_image.py "prompt" output.png [--model MODEL] [--aspect RATIO] [--size SIZE]
Examples:
python generate_image.py "A cat in space" cat.png
python generate_image.py "A logo for Acme Corp" logo.png --model gemini-3-pro-image-preview --aspect 1:1
python generate_image.py "Epic landscape" landscape.png --aspect 16:9 --size 2K
Environment:
GEMINI_API_KEY - Required API key
"""
import argparse
import os
import sys
from google import genai
from google.genai import types
def generate_image(
prompt: str,
output_path: str,
model: str = "gemini-2.5-flash-image",
aspect_ratio: str | None = None,
image_size: str | None = None,
) -> str | None:
"""Generate an image from a text prompt.
Args:
prompt: Text description of the image to generate
output_path: Path to save the generated image
model: Gemini model to use
aspect_ratio: Aspect ratio (1:1, 16:9, 9:16, etc.)
image_size: Resolution (1K, 2K, 4K - 4K only for pro model)
Returns:
Any text response from the model, or None
"""
api_key = os.environ.get("GEMINI_API_KEY")
if not api_key:
raise EnvironmentError("GEMINI_API_KEY environment variable not set")
client = genai.Client(api_key=api_key)
# Build config
config_kwargs = {"response_modalities": ["TEXT", "IMAGE"]}
image_config_kwargs = {}
if aspect_ratio:
image_config_kwargs["aspect_ratio"] = aspect_ratio
if image_size:
image_config_kwargs["image_size"] = image_size
if image_config_kwargs:
config_kwargs["image_config"] = types.ImageConfig(**image_config_kwargs)
config = types.GenerateContentConfig(**config_kwargs)
response = client.models.generate_content(
model=model,
contents=[prompt],
config=config,
)
text_response = None
image_saved = False
for part in response.parts:
if part.text is not None:
text_response = part.text
elif part.inline_data is not None:
image = part.as_image()
image.save(output_path)
image_saved = True
if not image_saved:
raise RuntimeError("No image was generated. Check your prompt and try again.")
return text_response
def main():
parser = argparse.ArgumentParser(
description="Generate images from text prompts using Gemini API",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
parser.add_argument("prompt", help="Text prompt describing the image")
parser.add_argument("output", help="Output file path (e.g., output.png)")
parser.add_argument(
"--model", "-m",
default="gemini-2.5-flash-image",
choices=["gemini-2.5-flash-image", "gemini-3-pro-image-preview"],
help="Model to use (default: gemini-2.5-flash-image)"
)
parser.add_argument(
"--aspect", "-a",
choices=["1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"],
help="Aspect ratio"
)
parser.add_argument(
"--size", "-s",
choices=["1K", "2K", "4K"],
help="Image resolution (4K only available with pro model)"
)
args = parser.parse_args()
try:
text = generate_image(
prompt=args.prompt,
output_path=args.output,
model=args.model,
aspect_ratio=args.aspect,
image_size=args.size,
)
print(f"Image saved to: {args.output}")
if text:
print(f"Model response: {text}")
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,216 @@
#!/usr/bin/env python3
"""
Interactive multi-turn image generation and refinement using Gemini API.
Usage:
python multi_turn_chat.py [--model MODEL] [--output-dir DIR]
This starts an interactive session where you can:
- Generate images from prompts
- Iteratively refine images through conversation
- Load existing images for editing
- Save images at any point
Commands:
/save [filename] - Save current image
/load <path> - Load an image into the conversation
/clear - Start fresh conversation
/quit - Exit
Environment:
GEMINI_API_KEY - Required API key
"""
import argparse
import os
import sys
from datetime import datetime
from pathlib import Path
from PIL import Image
from google import genai
from google.genai import types
class ImageChat:
"""Interactive chat session for image generation and refinement."""
def __init__(
self,
model: str = "gemini-2.5-flash-image",
output_dir: str = ".",
):
api_key = os.environ.get("GEMINI_API_KEY")
if not api_key:
raise EnvironmentError("GEMINI_API_KEY environment variable not set")
self.client = genai.Client(api_key=api_key)
self.model = model
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.chat = None
self.current_image = None
self.image_count = 0
self._init_chat()
def _init_chat(self):
"""Initialize or reset the chat session."""
config = types.GenerateContentConfig(
response_modalities=["TEXT", "IMAGE"]
)
self.chat = self.client.chats.create(
model=self.model,
config=config,
)
self.current_image = None
def send_message(self, message: str, image: Image.Image | None = None) -> tuple[str | None, Image.Image | None]:
"""Send a message and optionally an image, return response text and image."""
contents = []
if message:
contents.append(message)
if image:
contents.append(image)
if not contents:
return None, None
response = self.chat.send_message(contents)
text_response = None
image_response = None
for part in response.parts:
if part.text is not None:
text_response = part.text
elif part.inline_data is not None:
image_response = part.as_image()
self.current_image = image_response
return text_response, image_response
def save_image(self, filename: str | None = None) -> str | None:
"""Save the current image to a file."""
if self.current_image is None:
return None
if filename is None:
self.image_count += 1
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"image_{timestamp}_{self.image_count}.png"
filepath = self.output_dir / filename
self.current_image.save(filepath)
return str(filepath)
def load_image(self, path: str) -> Image.Image:
"""Load an image from disk."""
img = Image.open(path)
self.current_image = img
return img
def main():
parser = argparse.ArgumentParser(
description="Interactive multi-turn image generation",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__
)
parser.add_argument(
"--model", "-m",
default="gemini-2.5-flash-image",
choices=["gemini-2.5-flash-image", "gemini-3-pro-image-preview"],
help="Model to use"
)
parser.add_argument(
"--output-dir", "-o",
default=".",
help="Directory to save images"
)
args = parser.parse_args()
try:
chat = ImageChat(model=args.model, output_dir=args.output_dir)
except Exception as e:
print(f"Error initializing: {e}", file=sys.stderr)
sys.exit(1)
print(f"Gemini Image Chat ({args.model})")
print("Commands: /save [name], /load <path>, /clear, /quit")
print("-" * 50)
while True:
try:
user_input = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
# Handle commands
if user_input.startswith("/"):
parts = user_input.split(maxsplit=1)
cmd = parts[0].lower()
arg = parts[1] if len(parts) > 1 else None
if cmd == "/quit":
print("Goodbye!")
break
elif cmd == "/clear":
chat._init_chat()
print("Conversation cleared.")
continue
elif cmd == "/save":
path = chat.save_image(arg)
if path:
print(f"Image saved to: {path}")
else:
print("No image to save.")
continue
elif cmd == "/load":
if not arg:
print("Usage: /load <path>")
continue
try:
chat.load_image(arg)
print(f"Loaded: {arg}")
print("You can now describe edits to make.")
except Exception as e:
print(f"Error loading image: {e}")
continue
else:
print(f"Unknown command: {cmd}")
continue
# Send message to model
try:
# If we have a loaded image and this is first message, include it
image_to_send = None
if chat.current_image and not chat.chat.history:
image_to_send = chat.current_image
text, image = chat.send_message(user_input, image_to_send)
if text:
print(f"\nGemini: {text}")
if image:
# Auto-save
path = chat.save_image()
print(f"\n[Image generated: {path}]")
except Exception as e:
print(f"\nError: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@@ -0,0 +1,96 @@
---
name: webapp-testing
description: Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
license: Complete terms in LICENSE.txt
---
# Web Application Testing
To test local web applications, write native Python Playwright scripts.
**Helper Scripts Available**:
- `scripts/with_server.py` - Manages server lifecycle (supports multiple servers)
**Always run scripts with `--help` first** to see usage. DO NOT read the source until you try running the script first and find that a customized solution is abslutely necessary. These scripts can be very large and thus pollute your context window. They exist to be called directly as black-box scripts rather than ingested into your context window.
## Decision Tree: Choosing Your Approach
```
User task → Is it static HTML?
├─ Yes → Read HTML file directly to identify selectors
│ ├─ Success → Write Playwright script using selectors
│ └─ Fails/Incomplete → Treat as dynamic (below)
└─ No (dynamic webapp) → Is the server already running?
├─ No → Run: python scripts/with_server.py --help
│ Then use the helper + write simplified Playwright script
└─ Yes → Reconnaissance-then-action:
1. Navigate and wait for networkidle
2. Take screenshot or inspect DOM
3. Identify selectors from rendered state
4. Execute actions with discovered selectors
```
## Example: Using with_server.py
To start a server, run `--help` first, then use the helper:
**Single server:**
```bash
python scripts/with_server.py --server "npm run dev" --port 5173 -- python your_automation.py
```
**Multiple servers (e.g., backend + frontend):**
```bash
python scripts/with_server.py \
--server "cd backend && python server.py" --port 3000 \
--server "cd frontend && npm run dev" --port 5173 \
-- python your_automation.py
```
To create an automation script, include only Playwright logic (servers are managed automatically):
```python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Always launch chromium in headless mode
page = browser.new_page()
page.goto('http://localhost:5173') # Server already running and ready
page.wait_for_load_state('networkidle') # CRITICAL: Wait for JS to execute
# ... your automation logic
browser.close()
```
## Reconnaissance-Then-Action Pattern
1. **Inspect rendered DOM**:
```python
page.screenshot(path='/tmp/inspect.png', full_page=True)
content = page.content()
page.locator('button').all()
```
2. **Identify selectors** from inspection results
3. **Execute actions** using discovered selectors
## Common Pitfall
❌ **Don't** inspect the DOM before waiting for `networkidle` on dynamic apps
✅ **Do** wait for `page.wait_for_load_state('networkidle')` before inspection
## Best Practices
- **Use bundled scripts as black boxes** - To accomplish a task, consider whether one of the scripts available in `scripts/` can help. These scripts handle common, complex workflows reliably without cluttering the context window. Use `--help` to see usage, then invoke directly.
- Use `sync_playwright()` for synchronous scripts
- Always close the browser when done
- Use descriptive selectors: `text=`, `role=`, CSS selectors, or IDs
- Add appropriate waits: `page.wait_for_selector()` or `page.wait_for_timeout()`
## Reference Files
- **examples/** - Examples showing common patterns:
- `element_discovery.py` - Discovering buttons, links, and inputs on a page
- `static_html_automation.py` - Using file:// URLs for local HTML
- `console_logging.py` - Capturing console logs during automation

View File

@@ -0,0 +1,35 @@
from playwright.sync_api import sync_playwright
# Example: Capturing console logs during browser automation
url = 'http://localhost:5173' # Replace with your URL
console_logs = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={'width': 1920, 'height': 1080})
# Set up console log capture
def handle_console_message(msg):
console_logs.append(f"[{msg.type}] {msg.text}")
print(f"Console: [{msg.type}] {msg.text}")
page.on("console", handle_console_message)
# Navigate to page
page.goto(url)
page.wait_for_load_state('networkidle')
# Interact with the page (triggers console logs)
page.click('text=Dashboard')
page.wait_for_timeout(1000)
browser.close()
# Save console logs to file
with open('/mnt/user-data/outputs/console.log', 'w') as f:
f.write('\n'.join(console_logs))
print(f"\nCaptured {len(console_logs)} console messages")
print(f"Logs saved to: /mnt/user-data/outputs/console.log")

View File

@@ -0,0 +1,40 @@
from playwright.sync_api import sync_playwright
# Example: Discovering buttons and other elements on a page
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate to page and wait for it to fully load
page.goto('http://localhost:5173')
page.wait_for_load_state('networkidle')
# Discover all buttons on the page
buttons = page.locator('button').all()
print(f"Found {len(buttons)} buttons:")
for i, button in enumerate(buttons):
text = button.inner_text() if button.is_visible() else "[hidden]"
print(f" [{i}] {text}")
# Discover links
links = page.locator('a[href]').all()
print(f"\nFound {len(links)} links:")
for link in links[:5]: # Show first 5
text = link.inner_text().strip()
href = link.get_attribute('href')
print(f" - {text} -> {href}")
# Discover input fields
inputs = page.locator('input, textarea, select').all()
print(f"\nFound {len(inputs)} input fields:")
for input_elem in inputs:
name = input_elem.get_attribute('name') or input_elem.get_attribute('id') or "[unnamed]"
input_type = input_elem.get_attribute('type') or 'text'
print(f" - {name} ({input_type})")
# Take screenshot for visual reference
page.screenshot(path='/tmp/page_discovery.png', full_page=True)
print("\nScreenshot saved to /tmp/page_discovery.png")
browser.close()

View File

@@ -0,0 +1,33 @@
from playwright.sync_api import sync_playwright
import os
# Example: Automating interaction with static HTML files using file:// URLs
html_file_path = os.path.abspath('path/to/your/file.html')
file_url = f'file://{html_file_path}'
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(viewport={'width': 1920, 'height': 1080})
# Navigate to local HTML file
page.goto(file_url)
# Take screenshot
page.screenshot(path='/mnt/user-data/outputs/static_page.png', full_page=True)
# Interact with elements
page.click('text=Click Me')
page.fill('#name', 'John Doe')
page.fill('#email', 'john@example.com')
# Submit form
page.click('button[type="submit"]')
page.wait_for_timeout(500)
# Take final screenshot
page.screenshot(path='/mnt/user-data/outputs/after_submit.png', full_page=True)
browser.close()
print("Static HTML automation completed!")

View File

@@ -0,0 +1,106 @@
#!/usr/bin/env python3
"""
Start one or more servers, wait for them to be ready, run a command, then clean up.
Usage:
# Single server
python scripts/with_server.py --server "npm run dev" --port 5173 -- python automation.py
python scripts/with_server.py --server "npm start" --port 3000 -- python test.py
# Multiple servers
python scripts/with_server.py \
--server "cd backend && python server.py" --port 3000 \
--server "cd frontend && npm run dev" --port 5173 \
-- python test.py
"""
import subprocess
import socket
import time
import sys
import argparse
def is_server_ready(port, timeout=30):
"""Wait for server to be ready by polling the port."""
start_time = time.time()
while time.time() - start_time < timeout:
try:
with socket.create_connection(('localhost', port), timeout=1):
return True
except (socket.error, ConnectionRefusedError):
time.sleep(0.5)
return False
def main():
parser = argparse.ArgumentParser(description='Run command with one or more servers')
parser.add_argument('--server', action='append', dest='servers', required=True, help='Server command (can be repeated)')
parser.add_argument('--port', action='append', dest='ports', type=int, required=True, help='Port for each server (must match --server count)')
parser.add_argument('--timeout', type=int, default=30, help='Timeout in seconds per server (default: 30)')
parser.add_argument('command', nargs=argparse.REMAINDER, help='Command to run after server(s) ready')
args = parser.parse_args()
# Remove the '--' separator if present
if args.command and args.command[0] == '--':
args.command = args.command[1:]
if not args.command:
print("Error: No command specified to run")
sys.exit(1)
# Parse server configurations
if len(args.servers) != len(args.ports):
print("Error: Number of --server and --port arguments must match")
sys.exit(1)
servers = []
for cmd, port in zip(args.servers, args.ports):
servers.append({'cmd': cmd, 'port': port})
server_processes = []
try:
# Start all servers
for i, server in enumerate(servers):
print(f"Starting server {i+1}/{len(servers)}: {server['cmd']}")
# Use shell=True to support commands with cd and &&
process = subprocess.Popen(
server['cmd'],
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
server_processes.append(process)
# Wait for this server to be ready
print(f"Waiting for server on port {server['port']}...")
if not is_server_ready(server['port'], timeout=args.timeout):
raise RuntimeError(f"Server failed to start on port {server['port']} within {args.timeout}s")
print(f"Server ready on port {server['port']}")
print(f"\nAll {len(servers)} server(s) ready")
# Run the command
print(f"Running: {' '.join(args.command)}\n")
result = subprocess.run(args.command)
sys.exit(result.returncode)
finally:
# Clean up all servers
print(f"\nStopping {len(server_processes)} server(s)...")
for i, process in enumerate(server_processes):
try:
process.terminate()
process.wait(timeout=5)
except subprocess.TimeoutExpired:
process.kill()
process.wait()
print(f"Server {i+1} stopped")
print("All servers stopped")
if __name__ == '__main__':
main()