Files
gh-dnvriend-aws-polly-tts-t…/skills/aws-polly-tts-tool/SKILL.md
2025-11-29 18:23:08 +08:00

18 KiB

name, description
name description
skill-aws-polly-tts-tool AWS Polly TTS CLI for text-to-speech synthesis

When to use

  • Converting text to lifelike speech using AWS Polly
  • Working with multiple voice engines and output formats
  • Tracking TTS costs and AWS billing
  • Implementing TTS in automation pipelines

AWS Polly TTS Tool Skill

Purpose

Professional AWS Polly text-to-speech CLI and library with agent-friendly design, enabling conversion of text to lifelike speech using Amazon Polly's deep learning technology. Supports 60+ voices in 30+ languages across four quality tiers with comprehensive cost tracking.

When to Use This Skill

Use this skill when:

  • You need to convert text to speech using AWS Polly
  • You want to explore available voices and engines
  • You need to track TTS costs or query billing data
  • You're building automation with TTS capabilities
  • You need SSML support for advanced speech control
  • You want to work with different audio formats

Do NOT use this skill for:

  • Non-AWS TTS services (Google, Azure, etc.)
  • Real-time streaming TTS (use AWS SDK directly)
  • Voice cloning or training (Polly doesn't support this)

CLI Tool: aws-polly-tts-tool

Professional AWS Polly TTS CLI and Python library designed with CLI-first philosophy for both command-line and programmatic use.

Installation

# Clone repository
git clone https://github.com/dnvriend/aws-polly-tts-tool.git
cd aws-polly-tts-tool

# Install with uv (Python 3.12)
uv tool install . --python 3.12

# Verify installation
aws-polly-tts-tool --version

Prerequisites

  • Python 3.12+ (Python 3.13+ has pydub compatibility issues)
  • AWS credentials configured
  • ffmpeg for audio playback (not required for file output)
  • IAM permissions: polly:DescribeVoices, polly:SynthesizeSpeech, ce:GetCostAndUsage

Quick Start

# Play text with default voice
aws-polly-tts-tool synthesize "Hello world"

# Save to file
aws-polly-tts-tool synthesize "Hello world" --output speech.mp3

# List available voices
aws-polly-tts-tool list-voices

# Show pricing
aws-polly-tts-tool pricing

Progressive Disclosure

📖 Core Commands (Click to expand)

synthesize - Convert Text to Speech

Main TTS command with full feature support including multiple engines, voices, and output formats.

Usage:

aws-polly-tts-tool synthesize "TEXT" [OPTIONS]

Arguments:

  • TEXT: Text to synthesize (required, or use --stdin)
  • --stdin / -s: Read text from stdin (enables piping)
  • --voice TEXT: Voice ID (default: Joanna)
  • --output PATH / -o PATH: Save audio to file instead of playing
  • --format TEXT / -f TEXT: Output format (mp3, ogg_vorbis, pcm) - default: mp3
  • --engine TEXT / -e TEXT: Voice engine (standard, neural, generative, long-form) - default: neural
  • --ssml: Treat input as SSML markup
  • --show-cost: Display character count and cost estimate
  • --region TEXT / -r TEXT: AWS region override
  • -V/-VV/-VVV: Verbosity (INFO/DEBUG/TRACE with AWS SDK details)

Examples:

# Basic synthesis with default voice (Joanna, neural)
aws-polly-tts-tool synthesize "Hello world"

# Use different voice and engine
aws-polly-tts-tool synthesize "Hello" --voice Matthew --engine generative

# Save to file with specific format
aws-polly-tts-tool synthesize "Hello world" --output speech.mp3 --format mp3

# Read from stdin
echo "Hello world" | aws-polly-tts-tool synthesize --stdin

# Read from file
cat article.txt | aws-polly-tts-tool synthesize --stdin --output article.mp3

# Use SSML for advanced control
aws-polly-tts-tool synthesize '<speak>Hello <break time="500ms"/> world</speak>' --ssml

# Show cost estimate
aws-polly-tts-tool synthesize "Hello world" --show-cost

# Multiple options combined with debugging
cat article.txt | aws-polly-tts-tool synthesize --stdin \
    --voice Joanna \
    --engine neural \
    --output article.mp3 \
    --show-cost \
    -VV

Output:

  • Audio played through speakers (default) or saved to file
  • Character count and cost estimate (with --show-cost)
  • Logs to stderr, keeping stdout clean for piping

list-voices - Discover Available Voices

List and filter AWS Polly voices by engine, language, and gender.

Usage:

aws-polly-tts-tool list-voices [OPTIONS]

Options:

  • --engine TEXT / -e TEXT: Filter by engine (standard, neural, generative, long-form)
  • --language TEXT / -l TEXT: Filter by language code (e.g., en-US, es-ES, fr-FR)
  • --gender TEXT / -g TEXT: Filter by gender (Female, Male)
  • --region TEXT / -r TEXT: AWS region override
  • -V/-VV/-VVV: Verbosity levels

Examples:

# List all voices
aws-polly-tts-tool list-voices

# Filter by engine
aws-polly-tts-tool list-voices --engine neural

# Filter by language
aws-polly-tts-tool list-voices --language en-US

# Combine filters
aws-polly-tts-tool list-voices --engine neural --language en --gender Female

# Use with grep for searching
aws-polly-tts-tool list-voices | grep British
aws-polly-tts-tool list-voices --engine generative | grep Spanish

Output: Table with Voice, Gender, Language, Engines (supported), and Description columns. Dynamically fetched from Polly API (always up-to-date).


list-engines - Display Voice Engines

Show all available voice engines with technology, pricing, and best use cases.

Usage:

aws-polly-tts-tool list-engines

Examples:

# Show all engines with details
aws-polly-tts-tool list-engines

Output: Table showing:

  • Standard ($4/1M chars) - Traditional concatenative TTS, 3000 char limit
  • Neural ($16/1M chars) - Natural human-like voices, 3000 char limit
  • Generative ($30/1M chars) - Most advanced emotionally engaged, 3000 char limit
  • Long-form ($100/1M chars) - Optimized for audiobooks, 100,000 char limit

billing - Query AWS Costs

Query AWS Cost Explorer for actual Polly usage costs with engine breakdown.

Usage:

aws-polly-tts-tool billing [OPTIONS]

Options:

  • --days INT / -d INT: Number of days to query (default: 30)
  • --start-date TEXT: Custom start date (YYYY-MM-DD)
  • --end-date TEXT: Custom end date (YYYY-MM-DD)
  • --region TEXT / -r TEXT: AWS region for Cost Explorer
  • -V/-VV/-VVV: Verbosity levels

Examples:

# Last 30 days of Polly costs
aws-polly-tts-tool billing

# Last 7 days
aws-polly-tts-tool billing --days 7

# Custom date range
aws-polly-tts-tool billing --start-date 2025-01-01 --end-date 2025-01-31

# With verbose output
aws-polly-tts-tool billing --days 7 -V

Output: Total cost and breakdown by engine (Standard, Neural, Generative, Long-form) in USD.

Note: Requires IAM permission ce:GetCostAndUsage


pricing - Show Pricing Information

Display static pricing information for all Polly engines with cost examples.

Usage:

aws-polly-tts-tool pricing

Examples:

# Show pricing table and examples
aws-polly-tts-tool pricing

Output: Comprehensive pricing with:

  • Cost per 1M characters for each engine
  • Technology type and quality level
  • Character limits per request
  • Concurrent request limits
  • Free tier information
  • Best use cases
  • Cost examples (1,000 words, audiobooks)

info - Tool Configuration

Display AWS credentials status and tool configuration.

Usage:

aws-polly-tts-tool info

Examples:

# Verify AWS authentication and show config
aws-polly-tts-tool info

Output:

  • AWS credential status (Valid/Invalid)
  • Account ID, User ID, ARN
  • Available engines
  • Output formats
  • Useful command examples

completion - Shell Completion

Generate shell completion scripts for bash, zsh, or fish.

Usage:

aws-polly-tts-tool completion [bash|zsh|fish]

Arguments:

  • SHELL: Shell type (bash, zsh, or fish) - required

Examples:

# Generate bash completion
aws-polly-tts-tool completion bash

# Install for bash (add to ~/.bashrc)
eval "$(aws-polly-tts-tool completion bash)"

# Install for zsh (add to ~/.zshrc)
eval "$(aws-polly-tts-tool completion zsh)"

# Install for fish
aws-polly-tts-tool completion fish > ~/.config/fish/completions/aws-polly-tts-tool.fish

# File-based installation (recommended)
aws-polly-tts-tool completion bash > ~/.aws-polly-tts-tool-complete.bash
echo 'source ~/.aws-polly-tts-tool-complete.bash' >> ~/.bashrc

Output: Shell-specific completion script. After installation, restart shell or source config file.

⚙️ Advanced Features (Click to expand)

SSML Support

Full SSML (Speech Synthesis Markup Language) support for advanced speech control.

Features:

  • Prosody: Control rate, pitch, volume
  • Breaks: Add pauses of specific duration
  • Emphasis: Add emphasis to words
  • Speaking styles: Newscaster, conversational (select voices)
  • Phonemes: Control pronunciation

Examples:

# Basic pause
aws-polly-tts-tool synthesize '<speak>Hello <break time="500ms"/> world</speak>' --ssml

# Prosody control (speed, pitch, volume)
aws-polly-tts-tool synthesize '<speak><prosody rate="slow" pitch="low">Deep voice</prosody></speak>' --ssml

# Emphasis
aws-polly-tts-tool synthesize '<speak>I <emphasis level="strong">really</emphasis> like this</speak>' --ssml

# Newscaster style (Matthew, Joanna only)
aws-polly-tts-tool synthesize '<speak><amazon:domain name="news">Breaking news today</amazon:domain></speak>' --ssml --voice Matthew

# Multiple prosody attributes
aws-polly-tts-tool synthesize '<speak><prosody rate="fast" pitch="high" volume="loud">Excited announcement!</prosody></speak>' --ssml

SSML Resources:


Multi-Level Verbosity

Progressive logging detail for debugging without code changes.

Levels:

  • Default: Errors and warnings only (clean output)
  • -V (INFO): High-level operations (voice selection, file operations)
  • -VV (DEBUG): Detailed steps (validation, API calls, character counts)
  • -VVV (TRACE): Full AWS SDK internals (credentials, HTTP requests, boto3 events)

Examples:

# Default: No verbose output
aws-polly-tts-tool synthesize "Hello world" --output test.mp3

# INFO level (-V)
aws-polly-tts-tool synthesize "Hello world" -V --output test.mp3
# [INFO] Using voice: Joanna (neural engine)
# [INFO] Synthesizing audio to file: test.mp3

# DEBUG level (-VV)
aws-polly-tts-tool synthesize "Hello world" -VV --output test.mp3
# [DEBUG] Validating engine: neural
# [DEBUG] Validating output format: mp3
# [DEBUG] Initializing AWS Polly client
# [INFO] Using voice: Joanna (neural engine)
# [DEBUG] Synthesized 11 characters

# TRACE level (-VVV) - Full AWS SDK details
aws-polly-tts-tool synthesize "Hello world" -VVV --output test.mp3
# [DEBUG] Looking for credentials via: env
# [INFO] Found credentials in shared credentials file: ~/.aws/credentials
# [DEBUG] Starting new HTTPS connection (1): polly.eu-central-1.amazonaws.com:443
# [DEBUG] https://polly.eu-central-1.amazonaws.com:443 "POST /v1/speech HTTP/1.1" 200

Note: All logs go to stderr, keeping stdout clean for data/piping.


Library Usage

Import and use as a Python library for programmatic access.

Basic Usage:

from aws_polly_tts_tool import (
    get_polly_client,
    synthesize_audio,
    save_speech,
    VoiceManager,
    calculate_cost,
)

# Initialize client
client = get_polly_client(region="us-east-1")

# Synthesize audio
audio_bytes, char_count = synthesize_audio(
    client=client,
    text="Hello world",
    voice_id="Joanna",
    output_format="mp3",
    engine="neural"
)

# Save to file
save_speech(
    client=client,
    text="Hello world",
    voice_id="Joanna",
    output_path=Path("output.mp3"),
    engine="neural"
)

# List voices
voice_manager = VoiceManager(client)
voices = voice_manager.list_voices(engine="neural", language="en")

# Calculate cost
cost = calculate_cost(character_count=5000, engine="neural")
print(f"Estimated cost: ${cost:.4f}")

Public API:

  • get_polly_client(region=None) - Initialize boto3 Polly client
  • synthesize_audio(client, text, voice_id, output_format, engine, text_type) - Synthesize audio
  • save_speech(client, text, voice_id, output_path, ...) - Save to file
  • play_speech(client, text, voice_id, ...) - Play through speakers
  • VoiceManager(client) - Voice discovery and management
  • calculate_cost(char_count, engine) - Cost estimation

Voice Engine Selection Guide

Standard Engine ($4/1M chars)

  • Technology: Traditional concatenative TTS
  • Quality: Basic synthetic sound
  • Limit: 3,000 chars/request
  • Best for: Cost-sensitive applications, basic announcements
  • Free tier: 5M chars/month (12 months)

Neural Engine ($16/1M chars)

  • Technology: Deep learning neural networks
  • Quality: Natural, human-like voices
  • Limit: 3,000 chars/request
  • Best for: General-purpose TTS, recommended for most use cases
  • Free tier: 1M chars/month (12 months)

Generative Engine ($30/1M chars)

  • Technology: Advanced generative AI
  • Quality: Most lifelike, emotionally engaged
  • Limit: 3,000 chars/request
  • Best for: High-quality content, brand voices, engaging experiences
  • Free tier: None

Long-form Engine ($100/1M chars)

  • Technology: Neural with long-context optimization
  • Quality: Consistent over long passages
  • Limit: 100,000 chars/request
  • Best for: Audiobooks, long articles, consistent narration
  • Free tier: None

Decision Matrix:

  • Budget-conscious → Standard
  • General use → Neural (recommended)
  • Premium quality → Generative
  • Audiobooks/articles → Long-form

Cost Tracking Strategies

Immediate Estimates:

# Use --show-cost for instant character count and cost
aws-polly-tts-tool synthesize "Text" --show-cost

Actual Billing:

# Query real AWS costs with Cost Explorer
aws-polly-tts-tool billing --days 30

Cost Optimization Tips:

  1. Use Standard engine for non-critical audio
  2. Cache synthesized audio files to avoid re-synthesis
  3. Batch process text for efficiency
  4. Use Long-form engine only for actual long content
  5. Monitor with billing command regularly

Cost Examples:

  • 1,000 words (~5,000 chars):
    • Standard: $0.02
    • Neural: $0.08
    • Generative: $0.15
    • Long-form: $0.50
  • 50,000 word audiobook:
    • Standard: $1.00
    • Neural: $4.00
    • Generative: $7.50
    • Long-form: $25.00
🔧 Troubleshooting (Click to expand)

Common Issues

Issue: No AWS credentials found

# Symptom
Error: Unable to locate credentials

Solution:

# Configure AWS credentials
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Verify with
aws-polly-tts-tool info

Issue: Audio playback fails on Python 3.13+

# Symptom
Error: No module named 'audioop'

Solution: Option 1: Use Python 3.12 (recommended)

mise use python@3.12
uv tool install . --python 3.12

Option 2: Save to file instead (works on all Python versions)

aws-polly-tts-tool synthesize "Hello" --output speech.mp3

Issue: Voice not found

# Symptom
Error: Voice 'invalid' not found

Solution:

# List available voices
aws-polly-tts-tool list-voices

# Filter by engine
aws-polly-tts-tool list-voices --engine neural

# Case-sensitive voice names
aws-polly-tts-tool synthesize "Hello" --voice Joanna  # Correct

Issue: Engine not supported by voice

# Symptom
Error: Voice doesn't support this engine

Solution:

# Check which engines a voice supports
aws-polly-tts-tool list-voices | grep "VoiceName"

# Not all voices support all engines
# Example: Standard voices don't support neural engine

Issue: Cost Explorer access denied

# Symptom
Error: AccessDeniedException when calling GetCostAndUsage

Solution: Add IAM permission ce:GetCostAndUsage:

{
  "Effect": "Allow",
  "Action": ["ce:GetCostAndUsage"],
  "Resource": "*"
}

Issue: Text too long for engine

# Symptom
Error: Text exceeds character limit

Solution:

  • Standard/Neural/Generative: Max 3,000 chars per request
  • Long-form: Max 100,000 chars per request
  • Split long text into chunks or use Long-form engine

Getting Help

# General help
aws-polly-tts-tool --help

# Command-specific help
aws-polly-tts-tool synthesize --help
aws-polly-tts-tool list-voices --help

# Show version
aws-polly-tts-tool --version

# Verify configuration
aws-polly-tts-tool info

Debug Mode

Use progressive verbosity to diagnose issues:

# Basic debug info
aws-polly-tts-tool synthesize "Hello" -V

# Detailed debug info
aws-polly-tts-tool synthesize "Hello" -VV

# Full AWS SDK trace
aws-polly-tts-tool synthesize "Hello" -VVV

Best Practices

  1. Default to Neural Engine: Best balance of quality and cost for most use cases
  2. Use SSML for Control: Add pauses, emphasis, and prosody for natural speech
  3. Cache Audio Files: Save synthesized audio to avoid repeated API calls and costs
  4. Monitor Costs: Use billing command to track actual spending
  5. Validate Voice Support: Use list-voices to check engine compatibility before synthesis
  6. Save Critical Audio: Use --output to save important audio for offline use
  7. Use Verbosity: Add -V/-VV/-VVV when debugging issues
  8. Leverage stdin: Pipe text from files or commands for automation

Resources