Initial commit

2025-11-30 08:38:41 +08:00
commit 7d5b628e7d
13 changed files with 1610 additions and 0 deletions
--- a/skills/audio-transcript-cleanup/SKILL.md
+++ b/skills/audio-transcript-cleanup/SKILL.md
@@ -0,0 +1,54 @@
+---
+name: audio-transcription-cleanup
+description: Transform messy voice transcription text into well-formatted, human-readable documents while preserving original meaning
+---
+
+# Audio Transcription Cleanup
+
+Clean up raw audio transcriptions by removing filler words, fixing errors, and adding proper structure.
+
+## Usage
+
+Use the `audio_transcript_cleanup.py` script to process transcript files:
+
+```bash
+# Use default output location (~/tmp/cleaned_transcript.md - allows overwrite)
+python scripts/audio_transcript_cleanup.py --transcript-file /path/to/transcript.txt
+
+# Specify custom output location (cannot overwrite existing files)
+python scripts/audio_transcript_cleanup.py --transcript-file /path/to/transcript.txt --output /path/to/output.md
+```
+
+## What It Does
+
+The script automatically:
+- Removes verbal artifacts (um, uh, like, you know, 呃, 啊, 那个, etc.)
+- Fixes spelling and grammar errors
+- Adds semantic paragraph breaks and section headings
+- Converts spoken fragments into complete sentences
+- Preserves all original information (no summarization)
+- Auto-detects language and maintains natural expression
+
+## Options
+
+- `--transcript-file` (required) - Path to the transcript file to clean up
+- `--output` (optional) - Custom output path (default: `~/tmp/cleaned_transcript.md`)
+
+## Output Behavior
+
+- **Default location**: `~/tmp/cleaned_transcript.md` - Allows overwrite
+- **Custom location**: Cannot overwrite existing files (raises error if file exists)
+
+## Language Support
+
+Auto-detects and works with:
+- English
+- Chinese (Mandarin, Cantonese)
+- Mixed language content
+- Multi-speaker transcriptions
+
+## Requirements
+
+- Python 3.11+
+- Claude CLI must be installed and accessible
+- Transcript file must exist at specified path
--- a/skills/audio-transcript-cleanup/scripts/audio_transcript_cleanup.py
+++ b/skills/audio-transcript-cleanup/scripts/audio_transcript_cleanup.py
@@ -0,0 +1,329 @@
+# -*- coding: utf-8 -*-
+
+"""
+Audio Transcription Cleanup Script
+
+This module transforms messy voice transcription text into well-formatted, human-readable
+markdown documents using Claude CLI. It preserves all original meaning and content while
+improving readability through intelligent text cleanup and restructuring.
+
+Key Features:
+    - Remove verbal artifacts (um, uh, like, filler words in any language)
+    - Fix spelling and grammar errors
+    - Add semantic paragraph breaks and section headings
+    - Support for single-speaker and multi-speaker content
+    - Multi-language support (English, Chinese, and more)
+    - Preserve all original information (no summarization)
+    - Default output location with overwrite capability
+    - Custom output locations with overwrite protection
+
+Default Behavior:
+    - Output: ~/tmp/cleaned_transcript.md (allows overwrite)
+    - Language: Auto-detected from source
+    - Format: Well-structured markdown with headings
+
+Custom Output:
+    - Cannot overwrite existing files (raises FileExistsError)
+    - Creates parent directories automatically
+
+Text Cleanup Operations:
+    - Remove filler words: um, uh, like, you know, 呃, 啊, 那个, 就是说
+    - Fix obvious noun errors and typos
+    - Correct grammar mistakes
+    - Convert spoken fragments to complete sentences
+    - Add descriptive section headings
+    - Organize content into semantic paragraphs
+
+Example Usage:
+    # Use default output location (allows overwrite)
+    $ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt
+
+    # Specify custom output location (cannot overwrite existing file)
+    $ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt --output ~/Documents/cleaned.md
+
+Requirements:
+    - Claude CLI must be installed and accessible
+    - Transcript file must exist at specified path
+
+Author: sanhe
+Plugin: youtube@sanhe-claude-code-plugins
+"""
+
+import subprocess
+import argparse
+from pathlib import Path
+
+dir_home = Path.home()
+path_cleaned_transcript = dir_home / "tmp" / "cleaned_transcript.md"
+
+prompt = """
+## Task
+Transform the messy voice transcription text provided below into a well-formatted, human-readable document while preserving ALL original meaning and content.
+
+## Key Principles
+- **Preserve original meaning**: Do not summarize or omit information
+- **Fix transcription artifacts**: Remove filler words, false starts, and repetitions
+- **Improve readability**: Organize into proper paragraphs with clear structure
+- **Handle multi-speaker content**: Clearly attribute dialogue when multiple speakers are present
+- **Enhance document structure**: Add semantic paragraphing and descriptive headings
+
+## Processing Instructions
+
+### Step 1: Analyze the Input
+Examine the transcription to identify:
+- **Primary language**: Determine the dominant language of the transcription
+- **Number of speakers**: Single vs. multi-speaker content
+- **Main topics or themes**: Identify distinct topics for sectioning
+- **Transcription quality issues**: Filler words, repetitions, false starts, obvious errors
+
+### Step 2: Text Cleanup Operations
+
+1. **Remove verbal artifacts**:
+   - Filler words: "um", "uh", "like", "you know", "呃", "啊", "那个", "就是说", "嗯", "然后"
+   - False starts and self-corrections
+   - Unnecessary repetitions
+
+2. **Fix errors and improve accuracy**:
+   - **Correct obvious noun errors**: Fix misrecognized names, places, technical terms
+   - **Fix spelling mistakes**: Correct typos and transcription errors
+   - **Fix grammar errors**: Subject-verb agreement, tense consistency, word order
+   - **Clarify ambiguous terms**: Use context to determine correct words when transcription is unclear
+
+3. **Organize content structure**:
+   - **Semantic paragraphing**: Group related ideas into logical paragraphs based on meaning
+   - **Add section headings**: Create descriptive headings that summarize each section's content
+   - Convert spoken fragments into complete sentences
+   - Create logical flow between sections
+
+4. **For multi-speaker content**:
+   - Use clear speaker labels (Speaker A, B, or actual names if identified)
+   - Format as dialogue or meeting notes
+   - Preserve conversational context
+
+### Step 3: Language and Formatting
+
+**Language Selection Rules** (in priority order):
+1. If user explicitly specifies output language → use that language
+2. Otherwise → use the primary language of the original transcription
+3. Preserve technical terms and proper nouns in their original language/form
+
+**For single-speaker content**:
+```markdown
+# [Main Topic Title]
+
+## [Section 1 Heading]
+
+[Cleaned paragraph 1 content...]
+
+[Cleaned paragraph 2 content...]
+
+## [Section 2 Heading]
+
+[Cleaned content...]
+```
+
+**For multi-speaker content** - Option 1 (Dialogue format):
+```markdown
+# [Meeting/Discussion Title]
+
+## [Topic 1]
+
+**Speaker A:** [cleaned content]
+
+**Speaker B:** [cleaned content]
+
+## [Topic 2]
+
+**Speaker A:** [cleaned content]
+```
+
+**For multi-speaker content** - Option 2 (Meeting notes format):
+```markdown
+# [Meeting Title]
+
+## [Discussion Topic 1]
+
+Speaker A mentioned that [cleaned content]...
+
+Speaker B responded by explaining [cleaned content]...
+
+## [Discussion Topic 2]
+
+The group discussed [cleaned content]...
+```
+
+## Important Constraints
+- **NEVER summarize**: Include all information from original transcription
+- **NEVER add new substantive information**: Only reorganize and clarify existing content
+- **NEVER change the speaker's meaning or intent**
+- **DO add section headings**: These are structural additions that help readability
+- **Preserve technical terms and specific names** exactly as intended (fix transcription errors only)
+
+## Heading Guidelines
+- Section headings should be **concise and descriptive** (3-8 words typically)
+- Headings should reflect the **actual content** of each section
+- Use appropriate heading levels (H1 for main title, H2 for major sections, H3 for subsections)
+- Headings are added to **improve navigation**, not to interpret or editorialize
+
+## Language Support
+This skill works with transcriptions in any language:
+- **Chinese** (Mandarin, Cantonese, etc.)
+- **English**
+- **Mixed language** content (preserve code-switching naturally)
+- **Other languages** (apply language-appropriate grammar rules)
+
+## OUTPUT INSTRUCTIONS
+
+**CRITICAL**: Your response must contain ONLY the cleaned markdown document. 
+
+**DO NOT include**:
+- Any explanations before the document
+- Any explanations after the document
+- Any meta-commentary about the cleaning process
+- Any acknowledgments like "Here is the cleaned version"
+- Any markdown code fences (no ```markdown blocks)
+- Any introductory or concluding remarks
+
+**START your response immediately with the H1 title** (# [Title]) and **END immediately after the last content paragraph**.
+
+Your entire response = the cleaned document itself, nothing more.
+
+## INPUT TRANSCRIPTION:
+""".strip()
+
+
+def cleanup_transcript(path_transcript: Path):
+    """
+    Clean up audio transcription and save to file.
+
+    Args:
+        path_transcript: Path to the raw transcript file
+    """
+    # Read transcript content
+    content = path_transcript.read_text(encoding="utf-8")
+
+    # Call Claude CLI to process the transcript
+    args = [
+        "claude",
+        "--append-system-prompt",
+        prompt,
+        "--print",
+        content,
+    ]
+
+    result = subprocess.run(
+        args,
+        capture_output=True,
+        text=True,
+    )
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Claude CLI failed: {result.stderr}")
+
+    # Save the cleaned output
+    cleaned_transcript = result.stdout
+    return cleaned_transcript
+
+
+def main():
+    """
+    Main CLI entry point for cleaning up audio transcriptions.
+
+    This script transforms messy voice transcription text into well-formatted,
+    human-readable markdown documents while preserving the original meaning.
+
+    Example usage:
+        python audio_transcription_cleanup.py --transcript-file "/path/to/transcript.txt"
+
+    What it does:
+        - Removes filler words and verbal artifacts (um, uh, like, etc.)
+        - Fixes obvious spelling and grammar errors
+        - Adds semantic paragraph breaks and section headings
+        - Preserves all original information (no summarization)
+        - Maintains the speaker's meaning and intent
+
+    Requirements:
+        - Transcript file must exist
+        - Claude CLI must be installed and accessible
+    """
+    parser = argparse.ArgumentParser(
+        description="Clean up audio transcriptions into well-formatted markdown documents",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Use default output location (allows overwrite)
+  %(prog)s --transcript-file "~/tmp/transcript.txt"
+
+  # Specify custom output location (cannot overwrite existing file)
+  %(prog)s --transcript-file "~/tmp/transcript.txt" --output "~/Documents/cleaned.md"
+
+Cleanup Operations:
+  - Remove verbal artifacts: um, uh, like, you know, 呃, 啊, 那个
+  - Fix spelling and grammar errors
+  - Add semantic paragraphs and section headings
+  - Convert spoken fragments into complete sentences
+  - Preserve all original information
+
+Output Behavior:
+  - Default location (~/tmp/cleaned_transcript.md): Allows overwrite
+  - Custom location: Cannot overwrite existing files (will raise error)
+
+Note:
+  This script uses Claude CLI to perform intelligent transcript cleanup.
+  All original information is preserved - no summarization occurs.
+        """,
+    )
+
+    parser.add_argument(
+        "--transcript-file",
+        type=str,
+        required=True,
+        help=f"Path to the transcript file to clean up (required)",
+    )
+
+    parser.add_argument(
+        "--output",
+        type=str,
+        default=None,
+        help=f"Output file path for cleaned transcript (default: {path_cleaned_transcript}). Note: Default location allows overwrite, custom locations cannot overwrite existing files.",
+    )
+
+    args = parser.parse_args()
+
+    # Convert to Path objects and expand user home directory
+    path_transcript = Path(args.transcript_file).expanduser()
+
+    # Determine output path
+    if args.output:
+        path_output = Path(args.output).expanduser()
+        # For custom locations, check if file already exists
+        if (path_output != path_cleaned_transcript) and path_output.exists():
+            raise FileExistsError(
+                f"Output file already exists at {path_output}. "
+                f"Please choose a different location or remove the existing file. "
+                f"(Default location {path_cleaned_transcript} allows overwrite)"
+            )
+    else:
+        # Use default location (allows overwrite)
+        path_output = path_cleaned_transcript
+
+    # Ensure output directory exists
+    path_output.parent.mkdir(parents=True, exist_ok=True)
+
+    # Clean up the transcript
+    cleaned_transcript = cleanup_transcript(
+        path_transcript=path_transcript,
+    )
+
+    # Save to output file
+    path_output.write_text(cleaned_transcript, encoding="utf-8")
+
+    # Print success message with clickable file path
+    absolute_path = path_output.resolve()
+    print(f"✓ Transcript cleanup completed successfully!")
+    print(f"✓ Cleaned transcript saved to: file://{absolute_path}")
+    print(f"\nClick the link above to open the document.")
+
+
+if __name__ == "__main__":
+    main()