Initial commit

2025-11-30 08:48:52 +08:00
commit 6ec3196ecc
434 changed files with 125248 additions and 0 deletions
--- a/skills/ai-multimodal/SKILL.md
+++ b/skills/ai-multimodal/SKILL.md
@@ -0,0 +1,357 @@
+---
+name: ai-multimodal
+description: Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
+license: MIT
+allowed-tools:
+  - Bash
+  - Read
+  - Write
+  - Edit
+---
+
+# AI Multimodal Processing Skill
+
+Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
+
+## Core Capabilities
+
+### Audio Processing
+- Transcription with timestamps (up to 9.5 hours)
+- Audio summarization and analysis
+- Speech understanding and speaker identification
+- Music and environmental sound analysis
+- Text-to-speech generation with controllable voice
+
+### Image Understanding
+- Image captioning and description
+- Object detection with bounding boxes (2.0+)
+- Pixel-level segmentation (2.5+)
+- Visual question answering
+- Multi-image comparison (up to 3,600 images)
+- OCR and text extraction
+
+### Video Analysis
+- Scene detection and summarization
+- Video Q&A with temporal understanding
+- Transcription with visual descriptions
+- YouTube URL support
+- Long video processing (up to 6 hours)
+- Frame-level analysis
+
+### Document Extraction
+- Native PDF vision processing (up to 1,000 pages)
+- Table and form extraction
+- Chart and diagram analysis
+- Multi-page document understanding
+- Structured data output (JSON schema)
+- Format conversion (PDF to HTML/JSON)
+
+### Image Generation
+- Text-to-image generation
+- Image editing and modification
+- Multi-image composition (up to 3 images)
+- Iterative refinement
+- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
+- Controllable style and quality
+
+## Capability Matrix
+
+| Task | Audio | Image | Video | Document | Generation |
+|------|:-----:|:-----:|:-----:|:--------:|:----------:|
+| Transcription | ✓ | - | ✓ | - | - |
+| Summarization | ✓ | ✓ | ✓ | ✓ | - |
+| Q&A | ✓ | ✓ | ✓ | ✓ | - |
+| Object Detection | - | ✓ | ✓ | - | - |
+| Text Extraction | - | ✓ | - | ✓ | - |
+| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
+| Creation | TTS | - | - | - | ✓ |
+| Timestamps | ✓ | - | ✓ | - | - |
+| Segmentation | - | ✓ | - | - | - |
+
+## Model Selection Guide
+
+### Gemini 2.5 Series (Recommended)
+- **gemini-2.5-pro**: Highest quality, all features, 1M-2M context
+- **gemini-2.5-flash**: Best balance, all features, 1M-2M context
+- **gemini-2.5-flash-lite**: Lightweight, segmentation support
+- **gemini-2.5-flash-image**: Image generation only
+
+### Gemini 2.0 Series
+- **gemini-2.0-flash**: Fast processing, object detection
+- **gemini-2.0-flash-lite**: Lightweight option
+
+### Feature Requirements
+- **Segmentation**: Requires 2.5+ models
+- **Object Detection**: Requires 2.0+ models
+- **Multi-video**: Requires 2.5+ models
+- **Image Generation**: Requires flash-image model
+
+### Context Windows
+- **2M tokens**: ~6 hours video (low-res) or ~2 hours (default)
+- **1M tokens**: ~3 hours video (low-res) or ~1 hour (default)
+- **Audio**: 32 tokens/second (1 min = 1,920 tokens)
+- **PDF**: 258 tokens/page (fixed)
+- **Image**: 258-1,548 tokens based on size
+
+## Quick Start
+
+### Prerequisites
+
+**API Key Setup**: Supports both Google AI Studio and Vertex AI.
+
+The skill checks for `GEMINI_API_KEY` in this order:
+1. Process environment: `export GEMINI_API_KEY="your-key"`
+2. Project root: `.env`
+3. `.claude/.env`
+4. `.claude/skills/.env`
+5. `.claude/skills/ai-multimodal/.env`
+
+**Get API key**: https://aistudio.google.com/apikey
+
+**For Vertex AI**:
+```bash
+export GEMINI_USE_VERTEX=true
+export VERTEX_PROJECT_ID=your-gcp-project-id
+export VERTEX_LOCATION=us-central1  # Optional
+```
+
+**Install SDK**:
+```bash
+pip install google-genai python-dotenv pillow
+```
+
+### Common Patterns
+
+**Transcribe Audio**:
+```bash
+python scripts/gemini_batch_process.py \
+  --files audio.mp3 \
+  --task transcribe \
+  --model gemini-2.5-flash
+```
+
+**Analyze Image**:
+```bash
+python scripts/gemini_batch_process.py \
+  --files image.jpg \
+  --task analyze \
+  --prompt "Describe this image" \
+  --output docs/assets/<output-name>.md \
+  --model gemini-2.5-flash
+```
+
+**Process Video**:
+```bash
+python scripts/gemini_batch_process.py \
+  --files video.mp4 \
+  --task analyze \
+  --prompt "Summarize key points with timestamps" \
+  --output docs/assets/<output-name>.md \
+  --model gemini-2.5-flash
+```
+
+**Extract from PDF**:
+```bash
+python scripts/gemini_batch_process.py \
+  --files document.pdf \
+  --task extract \
+  --prompt "Extract table data as JSON" \
+  --output docs/assets/<output-name>.md \
+  --format json
+```
+
+**Generate Image**:
+```bash
+python scripts/gemini_batch_process.py \
+  --task generate \
+  --prompt "A futuristic city at sunset" \
+  --output docs/assets/<output-file-name> \
+  --model gemini-2.5-flash-image \
+  --aspect-ratio 16:9
+```
+
+**Optimize Media**:
+```bash
+# Prepare large video for processing
+python scripts/media_optimizer.py \
+  --input large-video.mp4 \
+  --output docs/assets/<output-file-name> \
+  --target-size 100MB
+
+# Batch optimize multiple files
+python scripts/media_optimizer.py \
+  --input-dir ./videos \
+  --output-dir docs/assets/optimized \
+  --quality 85
+```
+
+**Convert Documents to Markdown**:
+```bash
+# Convert to PDF
+python scripts/document_converter.py \
+  --input document.docx \
+  --output docs/assets/document.md
+
+# Extract pages
+python scripts/document_converter.py \
+  --input large.pdf \
+  --output docs/assets/chapter1.md \
+  --pages 1-20
+```
+
+## Supported Formats
+
+### Audio
+- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
+- Max 9.5 hours per request
+- Auto-downsampled to 16 Kbps mono
+
+### Images
+- PNG, JPEG, WEBP, HEIC, HEIF
+- Max 3,600 images per request
+- Resolution: ≤384px = 258 tokens, larger = tiled
+
+### Video
+- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
+- Max 6 hours (low-res) or 2 hours (default)
+- YouTube URLs supported (public only)
+
+### Documents
+- PDF only for vision processing
+- Max 1,000 pages
+- TXT, HTML, Markdown supported (text-only)
+
+### Size Limits
+- **Inline**: <20MB total request
+- **File API**: 2GB per file, 20GB project quota
+- **Retention**: 48 hours auto-delete
+
+## Reference Navigation
+
+For detailed implementation guidance, see:
+
+### Audio Processing
+- `references/audio-processing.md` - Transcription, analysis, TTS
+  - Timestamp handling and segment analysis
+  - Multi-speaker identification
+  - Non-speech audio analysis
+  - Text-to-speech generation
+
+### Image Understanding
+- `references/vision-understanding.md` - Captioning, detection, OCR
+  - Object detection and localization
+  - Pixel-level segmentation
+  - Visual question answering
+  - Multi-image comparison
+
+### Video Analysis
+- `references/video-analysis.md` - Scene detection, temporal understanding
+  - YouTube URL processing
+  - Timestamp-based queries
+  - Video clipping and FPS control
+  - Long video optimization
+
+### Document Extraction
+- `references/document-extraction.md` - PDF processing, structured output
+  - Table and form extraction
+  - Chart and diagram analysis
+  - JSON schema validation
+  - Multi-page handling
+
+### Image Generation
+- `references/image-generation.md` - Text-to-image, editing
+  - Prompt engineering strategies
+  - Image editing and composition
+  - Aspect ratio selection
+  - Safety settings
+
+## Cost Optimization
+
+### Token Costs
+**Input Pricing**:
+- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
+- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
+- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
+
+**Token Rates**:
+- Audio: 32 tokens/second (1 min = 1,920 tokens)
+- Video: ~300 tokens/second (default) or ~100 (low-res)
+- PDF: 258 tokens/page (fixed)
+- Image: 258-1,548 tokens based on size
+
+**TTS Pricing**:
+- Flash TTS: $10/1M tokens
+- Pro TTS: $20/1M tokens
+
+### Best Practices
+1. Use `gemini-2.5-flash` for most tasks (best price/performance)
+2. Use File API for files >20MB or repeated queries
+3. Optimize media before upload (see `media_optimizer.py`)
+4. Process specific segments instead of full videos
+5. Use lower FPS for static content
+6. Implement context caching for repeated queries
+7. Batch process multiple files in parallel
+
+## Rate Limits
+
+**Free Tier**:
+- 10-15 RPM (requests per minute)
+- 1M-4M TPM (tokens per minute)
+- 1,500 RPD (requests per day)
+
+**YouTube Limits**:
+- Free tier: 8 hours/day
+- Paid tier: No length limits
+- Public videos only
+
+**Storage Limits**:
+- 20GB per project
+- 2GB per file
+- 48-hour retention
+
+## Error Handling
+
+Common errors and solutions:
+- **400**: Invalid format/size - validate before upload
+- **401**: Invalid API key - check configuration
+- **403**: Permission denied - verify API key restrictions
+- **404**: File not found - ensure file uploaded and active
+- **429**: Rate limit exceeded - implement exponential backoff
+- **500**: Server error - retry with backoff
+
+## Scripts Overview
+
+All scripts support unified API key detection and error handling:
+
+**gemini_batch_process.py**: Batch process multiple media files
+- Supports all modalities (audio, image, video, PDF)
+- Progress tracking and error recovery
+- Output formats: JSON, Markdown, CSV
+- Rate limiting and retry logic
+- Dry-run mode
+
+**media_optimizer.py**: Prepare media for Gemini API
+- Compress videos/audio for size limits
+- Resize images appropriately
+- Split long videos into chunks
+- Format conversion
+- Quality vs size optimization
+
+**document_converter.py**: Convert documents to PDF
+- Convert DOCX, XLSX, PPTX to PDF
+- Extract page ranges
+- Optimize PDFs for Gemini
+- Extract images from PDFs
+- Batch conversion support
+
+Run any script with `--help` for detailed usage.
+
+## Resources
+
+- [Audio API Docs](https://ai.google.dev/gemini-api/docs/audio)
+- [Image API Docs](https://ai.google.dev/gemini-api/docs/image-understanding)
+- [Video API Docs](https://ai.google.dev/gemini-api/docs/video-understanding)
+- [Document API Docs](https://ai.google.dev/gemini-api/docs/document-processing)
+- [Image Gen Docs](https://ai.google.dev/gemini-api/docs/image-generation)
+- [Get API Key](https://aistudio.google.com/apikey)
+- [Pricing](https://ai.google.dev/pricing)