Initial commit
This commit is contained in:
357
skills/ai-multimodal/SKILL.md
Normal file
357
skills/ai-multimodal/SKILL.md
Normal file
@@ -0,0 +1,357 @@
|
||||
---
|
||||
name: ai-multimodal
|
||||
description: Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
|
||||
license: MIT
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
- Write
|
||||
- Edit
|
||||
---
|
||||
|
||||
# AI Multimodal Processing Skill
|
||||
|
||||
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Audio Processing
|
||||
- Transcription with timestamps (up to 9.5 hours)
|
||||
- Audio summarization and analysis
|
||||
- Speech understanding and speaker identification
|
||||
- Music and environmental sound analysis
|
||||
- Text-to-speech generation with controllable voice
|
||||
|
||||
### Image Understanding
|
||||
- Image captioning and description
|
||||
- Object detection with bounding boxes (2.0+)
|
||||
- Pixel-level segmentation (2.5+)
|
||||
- Visual question answering
|
||||
- Multi-image comparison (up to 3,600 images)
|
||||
- OCR and text extraction
|
||||
|
||||
### Video Analysis
|
||||
- Scene detection and summarization
|
||||
- Video Q&A with temporal understanding
|
||||
- Transcription with visual descriptions
|
||||
- YouTube URL support
|
||||
- Long video processing (up to 6 hours)
|
||||
- Frame-level analysis
|
||||
|
||||
### Document Extraction
|
||||
- Native PDF vision processing (up to 1,000 pages)
|
||||
- Table and form extraction
|
||||
- Chart and diagram analysis
|
||||
- Multi-page document understanding
|
||||
- Structured data output (JSON schema)
|
||||
- Format conversion (PDF to HTML/JSON)
|
||||
|
||||
### Image Generation
|
||||
- Text-to-image generation
|
||||
- Image editing and modification
|
||||
- Multi-image composition (up to 3 images)
|
||||
- Iterative refinement
|
||||
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
|
||||
- Controllable style and quality
|
||||
|
||||
## Capability Matrix
|
||||
|
||||
| Task | Audio | Image | Video | Document | Generation |
|
||||
|------|:-----:|:-----:|:-----:|:--------:|:----------:|
|
||||
| Transcription | ✓ | - | ✓ | - | - |
|
||||
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
|
||||
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
|
||||
| Object Detection | - | ✓ | ✓ | - | - |
|
||||
| Text Extraction | - | ✓ | - | ✓ | - |
|
||||
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
|
||||
| Creation | TTS | - | - | - | ✓ |
|
||||
| Timestamps | ✓ | - | ✓ | - | - |
|
||||
| Segmentation | - | ✓ | - | - | - |
|
||||
|
||||
## Model Selection Guide
|
||||
|
||||
### Gemini 2.5 Series (Recommended)
|
||||
- **gemini-2.5-pro**: Highest quality, all features, 1M-2M context
|
||||
- **gemini-2.5-flash**: Best balance, all features, 1M-2M context
|
||||
- **gemini-2.5-flash-lite**: Lightweight, segmentation support
|
||||
- **gemini-2.5-flash-image**: Image generation only
|
||||
|
||||
### Gemini 2.0 Series
|
||||
- **gemini-2.0-flash**: Fast processing, object detection
|
||||
- **gemini-2.0-flash-lite**: Lightweight option
|
||||
|
||||
### Feature Requirements
|
||||
- **Segmentation**: Requires 2.5+ models
|
||||
- **Object Detection**: Requires 2.0+ models
|
||||
- **Multi-video**: Requires 2.5+ models
|
||||
- **Image Generation**: Requires flash-image model
|
||||
|
||||
### Context Windows
|
||||
- **2M tokens**: ~6 hours video (low-res) or ~2 hours (default)
|
||||
- **1M tokens**: ~3 hours video (low-res) or ~1 hour (default)
|
||||
- **Audio**: 32 tokens/second (1 min = 1,920 tokens)
|
||||
- **PDF**: 258 tokens/page (fixed)
|
||||
- **Image**: 258-1,548 tokens based on size
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
**API Key Setup**: Supports both Google AI Studio and Vertex AI.
|
||||
|
||||
The skill checks for `GEMINI_API_KEY` in this order:
|
||||
1. Process environment: `export GEMINI_API_KEY="your-key"`
|
||||
2. Project root: `.env`
|
||||
3. `.claude/.env`
|
||||
4. `.claude/skills/.env`
|
||||
5. `.claude/skills/ai-multimodal/.env`
|
||||
|
||||
**Get API key**: https://aistudio.google.com/apikey
|
||||
|
||||
**For Vertex AI**:
|
||||
```bash
|
||||
export GEMINI_USE_VERTEX=true
|
||||
export VERTEX_PROJECT_ID=your-gcp-project-id
|
||||
export VERTEX_LOCATION=us-central1 # Optional
|
||||
```
|
||||
|
||||
**Install SDK**:
|
||||
```bash
|
||||
pip install google-genai python-dotenv pillow
|
||||
```
|
||||
|
||||
### Common Patterns
|
||||
|
||||
**Transcribe Audio**:
|
||||
```bash
|
||||
python scripts/gemini_batch_process.py \
|
||||
--files audio.mp3 \
|
||||
--task transcribe \
|
||||
--model gemini-2.5-flash
|
||||
```
|
||||
|
||||
**Analyze Image**:
|
||||
```bash
|
||||
python scripts/gemini_batch_process.py \
|
||||
--files image.jpg \
|
||||
--task analyze \
|
||||
--prompt "Describe this image" \
|
||||
--output docs/assets/<output-name>.md \
|
||||
--model gemini-2.5-flash
|
||||
```
|
||||
|
||||
**Process Video**:
|
||||
```bash
|
||||
python scripts/gemini_batch_process.py \
|
||||
--files video.mp4 \
|
||||
--task analyze \
|
||||
--prompt "Summarize key points with timestamps" \
|
||||
--output docs/assets/<output-name>.md \
|
||||
--model gemini-2.5-flash
|
||||
```
|
||||
|
||||
**Extract from PDF**:
|
||||
```bash
|
||||
python scripts/gemini_batch_process.py \
|
||||
--files document.pdf \
|
||||
--task extract \
|
||||
--prompt "Extract table data as JSON" \
|
||||
--output docs/assets/<output-name>.md \
|
||||
--format json
|
||||
```
|
||||
|
||||
**Generate Image**:
|
||||
```bash
|
||||
python scripts/gemini_batch_process.py \
|
||||
--task generate \
|
||||
--prompt "A futuristic city at sunset" \
|
||||
--output docs/assets/<output-file-name> \
|
||||
--model gemini-2.5-flash-image \
|
||||
--aspect-ratio 16:9
|
||||
```
|
||||
|
||||
**Optimize Media**:
|
||||
```bash
|
||||
# Prepare large video for processing
|
||||
python scripts/media_optimizer.py \
|
||||
--input large-video.mp4 \
|
||||
--output docs/assets/<output-file-name> \
|
||||
--target-size 100MB
|
||||
|
||||
# Batch optimize multiple files
|
||||
python scripts/media_optimizer.py \
|
||||
--input-dir ./videos \
|
||||
--output-dir docs/assets/optimized \
|
||||
--quality 85
|
||||
```
|
||||
|
||||
**Convert Documents to Markdown**:
|
||||
```bash
|
||||
# Convert to PDF
|
||||
python scripts/document_converter.py \
|
||||
--input document.docx \
|
||||
--output docs/assets/document.md
|
||||
|
||||
# Extract pages
|
||||
python scripts/document_converter.py \
|
||||
--input large.pdf \
|
||||
--output docs/assets/chapter1.md \
|
||||
--pages 1-20
|
||||
```
|
||||
|
||||
## Supported Formats
|
||||
|
||||
### Audio
|
||||
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
|
||||
- Max 9.5 hours per request
|
||||
- Auto-downsampled to 16 Kbps mono
|
||||
|
||||
### Images
|
||||
- PNG, JPEG, WEBP, HEIC, HEIF
|
||||
- Max 3,600 images per request
|
||||
- Resolution: ≤384px = 258 tokens, larger = tiled
|
||||
|
||||
### Video
|
||||
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
|
||||
- Max 6 hours (low-res) or 2 hours (default)
|
||||
- YouTube URLs supported (public only)
|
||||
|
||||
### Documents
|
||||
- PDF only for vision processing
|
||||
- Max 1,000 pages
|
||||
- TXT, HTML, Markdown supported (text-only)
|
||||
|
||||
### Size Limits
|
||||
- **Inline**: <20MB total request
|
||||
- **File API**: 2GB per file, 20GB project quota
|
||||
- **Retention**: 48 hours auto-delete
|
||||
|
||||
## Reference Navigation
|
||||
|
||||
For detailed implementation guidance, see:
|
||||
|
||||
### Audio Processing
|
||||
- `references/audio-processing.md` - Transcription, analysis, TTS
|
||||
- Timestamp handling and segment analysis
|
||||
- Multi-speaker identification
|
||||
- Non-speech audio analysis
|
||||
- Text-to-speech generation
|
||||
|
||||
### Image Understanding
|
||||
- `references/vision-understanding.md` - Captioning, detection, OCR
|
||||
- Object detection and localization
|
||||
- Pixel-level segmentation
|
||||
- Visual question answering
|
||||
- Multi-image comparison
|
||||
|
||||
### Video Analysis
|
||||
- `references/video-analysis.md` - Scene detection, temporal understanding
|
||||
- YouTube URL processing
|
||||
- Timestamp-based queries
|
||||
- Video clipping and FPS control
|
||||
- Long video optimization
|
||||
|
||||
### Document Extraction
|
||||
- `references/document-extraction.md` - PDF processing, structured output
|
||||
- Table and form extraction
|
||||
- Chart and diagram analysis
|
||||
- JSON schema validation
|
||||
- Multi-page handling
|
||||
|
||||
### Image Generation
|
||||
- `references/image-generation.md` - Text-to-image, editing
|
||||
- Prompt engineering strategies
|
||||
- Image editing and composition
|
||||
- Aspect ratio selection
|
||||
- Safety settings
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Token Costs
|
||||
**Input Pricing**:
|
||||
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
|
||||
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
|
||||
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
|
||||
|
||||
**Token Rates**:
|
||||
- Audio: 32 tokens/second (1 min = 1,920 tokens)
|
||||
- Video: ~300 tokens/second (default) or ~100 (low-res)
|
||||
- PDF: 258 tokens/page (fixed)
|
||||
- Image: 258-1,548 tokens based on size
|
||||
|
||||
**TTS Pricing**:
|
||||
- Flash TTS: $10/1M tokens
|
||||
- Pro TTS: $20/1M tokens
|
||||
|
||||
### Best Practices
|
||||
1. Use `gemini-2.5-flash` for most tasks (best price/performance)
|
||||
2. Use File API for files >20MB or repeated queries
|
||||
3. Optimize media before upload (see `media_optimizer.py`)
|
||||
4. Process specific segments instead of full videos
|
||||
5. Use lower FPS for static content
|
||||
6. Implement context caching for repeated queries
|
||||
7. Batch process multiple files in parallel
|
||||
|
||||
## Rate Limits
|
||||
|
||||
**Free Tier**:
|
||||
- 10-15 RPM (requests per minute)
|
||||
- 1M-4M TPM (tokens per minute)
|
||||
- 1,500 RPD (requests per day)
|
||||
|
||||
**YouTube Limits**:
|
||||
- Free tier: 8 hours/day
|
||||
- Paid tier: No length limits
|
||||
- Public videos only
|
||||
|
||||
**Storage Limits**:
|
||||
- 20GB per project
|
||||
- 2GB per file
|
||||
- 48-hour retention
|
||||
|
||||
## Error Handling
|
||||
|
||||
Common errors and solutions:
|
||||
- **400**: Invalid format/size - validate before upload
|
||||
- **401**: Invalid API key - check configuration
|
||||
- **403**: Permission denied - verify API key restrictions
|
||||
- **404**: File not found - ensure file uploaded and active
|
||||
- **429**: Rate limit exceeded - implement exponential backoff
|
||||
- **500**: Server error - retry with backoff
|
||||
|
||||
## Scripts Overview
|
||||
|
||||
All scripts support unified API key detection and error handling:
|
||||
|
||||
**gemini_batch_process.py**: Batch process multiple media files
|
||||
- Supports all modalities (audio, image, video, PDF)
|
||||
- Progress tracking and error recovery
|
||||
- Output formats: JSON, Markdown, CSV
|
||||
- Rate limiting and retry logic
|
||||
- Dry-run mode
|
||||
|
||||
**media_optimizer.py**: Prepare media for Gemini API
|
||||
- Compress videos/audio for size limits
|
||||
- Resize images appropriately
|
||||
- Split long videos into chunks
|
||||
- Format conversion
|
||||
- Quality vs size optimization
|
||||
|
||||
**document_converter.py**: Convert documents to PDF
|
||||
- Convert DOCX, XLSX, PPTX to PDF
|
||||
- Extract page ranges
|
||||
- Optimize PDFs for Gemini
|
||||
- Extract images from PDFs
|
||||
- Batch conversion support
|
||||
|
||||
Run any script with `--help` for detailed usage.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Audio API Docs](https://ai.google.dev/gemini-api/docs/audio)
|
||||
- [Image API Docs](https://ai.google.dev/gemini-api/docs/image-understanding)
|
||||
- [Video API Docs](https://ai.google.dev/gemini-api/docs/video-understanding)
|
||||
- [Document API Docs](https://ai.google.dev/gemini-api/docs/document-processing)
|
||||
- [Image Gen Docs](https://ai.google.dev/gemini-api/docs/image-generation)
|
||||
- [Get API Key](https://aistudio.google.com/apikey)
|
||||
- [Pricing](https://ai.google.dev/pricing)
|
||||
Reference in New Issue
Block a user