Initial commit
This commit is contained in:
54
skills/audio-transcript-cleanup/SKILL.md
Normal file
54
skills/audio-transcript-cleanup/SKILL.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
name: audio-transcription-cleanup
|
||||
description: Transform messy voice transcription text into well-formatted, human-readable documents while preserving original meaning
|
||||
---
|
||||
|
||||
# Audio Transcription Cleanup
|
||||
|
||||
Clean up raw audio transcriptions by removing filler words, fixing errors, and adding proper structure.
|
||||
|
||||
## Usage
|
||||
|
||||
Use the `audio_transcript_cleanup.py` script to process transcript files:
|
||||
|
||||
```bash
|
||||
# Use default output location (~/tmp/cleaned_transcript.md - allows overwrite)
|
||||
python scripts/audio_transcript_cleanup.py --transcript-file /path/to/transcript.txt
|
||||
|
||||
# Specify custom output location (cannot overwrite existing files)
|
||||
python scripts/audio_transcript_cleanup.py --transcript-file /path/to/transcript.txt --output /path/to/output.md
|
||||
```
|
||||
|
||||
## What It Does
|
||||
|
||||
The script automatically:
|
||||
- Removes verbal artifacts (um, uh, like, you know, 呃, 啊, 那个, etc.)
|
||||
- Fixes spelling and grammar errors
|
||||
- Adds semantic paragraph breaks and section headings
|
||||
- Converts spoken fragments into complete sentences
|
||||
- Preserves all original information (no summarization)
|
||||
- Auto-detects language and maintains natural expression
|
||||
|
||||
## Options
|
||||
|
||||
- `--transcript-file` (required) - Path to the transcript file to clean up
|
||||
- `--output` (optional) - Custom output path (default: `~/tmp/cleaned_transcript.md`)
|
||||
|
||||
## Output Behavior
|
||||
|
||||
- **Default location**: `~/tmp/cleaned_transcript.md` - Allows overwrite
|
||||
- **Custom location**: Cannot overwrite existing files (raises error if file exists)
|
||||
|
||||
## Language Support
|
||||
|
||||
Auto-detects and works with:
|
||||
- English
|
||||
- Chinese (Mandarin, Cantonese)
|
||||
- Mixed language content
|
||||
- Multi-speaker transcriptions
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.11+
|
||||
- Claude CLI must be installed and accessible
|
||||
- Transcript file must exist at specified path
|
||||
@@ -0,0 +1,329 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
Audio Transcription Cleanup Script
|
||||
|
||||
This module transforms messy voice transcription text into well-formatted, human-readable
|
||||
markdown documents using Claude CLI. It preserves all original meaning and content while
|
||||
improving readability through intelligent text cleanup and restructuring.
|
||||
|
||||
Key Features:
|
||||
- Remove verbal artifacts (um, uh, like, filler words in any language)
|
||||
- Fix spelling and grammar errors
|
||||
- Add semantic paragraph breaks and section headings
|
||||
- Support for single-speaker and multi-speaker content
|
||||
- Multi-language support (English, Chinese, and more)
|
||||
- Preserve all original information (no summarization)
|
||||
- Default output location with overwrite capability
|
||||
- Custom output locations with overwrite protection
|
||||
|
||||
Default Behavior:
|
||||
- Output: ~/tmp/cleaned_transcript.md (allows overwrite)
|
||||
- Language: Auto-detected from source
|
||||
- Format: Well-structured markdown with headings
|
||||
|
||||
Custom Output:
|
||||
- Cannot overwrite existing files (raises FileExistsError)
|
||||
- Creates parent directories automatically
|
||||
|
||||
Text Cleanup Operations:
|
||||
- Remove filler words: um, uh, like, you know, 呃, 啊, 那个, 就是说
|
||||
- Fix obvious noun errors and typos
|
||||
- Correct grammar mistakes
|
||||
- Convert spoken fragments to complete sentences
|
||||
- Add descriptive section headings
|
||||
- Organize content into semantic paragraphs
|
||||
|
||||
Example Usage:
|
||||
# Use default output location (allows overwrite)
|
||||
$ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt
|
||||
|
||||
# Specify custom output location (cannot overwrite existing file)
|
||||
$ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt --output ~/Documents/cleaned.md
|
||||
|
||||
Requirements:
|
||||
- Claude CLI must be installed and accessible
|
||||
- Transcript file must exist at specified path
|
||||
|
||||
Author: sanhe
|
||||
Plugin: youtube@sanhe-claude-code-plugins
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
dir_home = Path.home()
|
||||
path_cleaned_transcript = dir_home / "tmp" / "cleaned_transcript.md"
|
||||
|
||||
prompt = """
|
||||
## Task
|
||||
Transform the messy voice transcription text provided below into a well-formatted, human-readable document while preserving ALL original meaning and content.
|
||||
|
||||
## Key Principles
|
||||
- **Preserve original meaning**: Do not summarize or omit information
|
||||
- **Fix transcription artifacts**: Remove filler words, false starts, and repetitions
|
||||
- **Improve readability**: Organize into proper paragraphs with clear structure
|
||||
- **Handle multi-speaker content**: Clearly attribute dialogue when multiple speakers are present
|
||||
- **Enhance document structure**: Add semantic paragraphing and descriptive headings
|
||||
|
||||
## Processing Instructions
|
||||
|
||||
### Step 1: Analyze the Input
|
||||
Examine the transcription to identify:
|
||||
- **Primary language**: Determine the dominant language of the transcription
|
||||
- **Number of speakers**: Single vs. multi-speaker content
|
||||
- **Main topics or themes**: Identify distinct topics for sectioning
|
||||
- **Transcription quality issues**: Filler words, repetitions, false starts, obvious errors
|
||||
|
||||
### Step 2: Text Cleanup Operations
|
||||
|
||||
1. **Remove verbal artifacts**:
|
||||
- Filler words: "um", "uh", "like", "you know", "呃", "啊", "那个", "就是说", "嗯", "然后"
|
||||
- False starts and self-corrections
|
||||
- Unnecessary repetitions
|
||||
|
||||
2. **Fix errors and improve accuracy**:
|
||||
- **Correct obvious noun errors**: Fix misrecognized names, places, technical terms
|
||||
- **Fix spelling mistakes**: Correct typos and transcription errors
|
||||
- **Fix grammar errors**: Subject-verb agreement, tense consistency, word order
|
||||
- **Clarify ambiguous terms**: Use context to determine correct words when transcription is unclear
|
||||
|
||||
3. **Organize content structure**:
|
||||
- **Semantic paragraphing**: Group related ideas into logical paragraphs based on meaning
|
||||
- **Add section headings**: Create descriptive headings that summarize each section's content
|
||||
- Convert spoken fragments into complete sentences
|
||||
- Create logical flow between sections
|
||||
|
||||
4. **For multi-speaker content**:
|
||||
- Use clear speaker labels (Speaker A, B, or actual names if identified)
|
||||
- Format as dialogue or meeting notes
|
||||
- Preserve conversational context
|
||||
|
||||
### Step 3: Language and Formatting
|
||||
|
||||
**Language Selection Rules** (in priority order):
|
||||
1. If user explicitly specifies output language → use that language
|
||||
2. Otherwise → use the primary language of the original transcription
|
||||
3. Preserve technical terms and proper nouns in their original language/form
|
||||
|
||||
**For single-speaker content**:
|
||||
```markdown
|
||||
# [Main Topic Title]
|
||||
|
||||
## [Section 1 Heading]
|
||||
|
||||
[Cleaned paragraph 1 content...]
|
||||
|
||||
[Cleaned paragraph 2 content...]
|
||||
|
||||
## [Section 2 Heading]
|
||||
|
||||
[Cleaned content...]
|
||||
```
|
||||
|
||||
**For multi-speaker content** - Option 1 (Dialogue format):
|
||||
```markdown
|
||||
# [Meeting/Discussion Title]
|
||||
|
||||
## [Topic 1]
|
||||
|
||||
**Speaker A:** [cleaned content]
|
||||
|
||||
**Speaker B:** [cleaned content]
|
||||
|
||||
## [Topic 2]
|
||||
|
||||
**Speaker A:** [cleaned content]
|
||||
```
|
||||
|
||||
**For multi-speaker content** - Option 2 (Meeting notes format):
|
||||
```markdown
|
||||
# [Meeting Title]
|
||||
|
||||
## [Discussion Topic 1]
|
||||
|
||||
Speaker A mentioned that [cleaned content]...
|
||||
|
||||
Speaker B responded by explaining [cleaned content]...
|
||||
|
||||
## [Discussion Topic 2]
|
||||
|
||||
The group discussed [cleaned content]...
|
||||
```
|
||||
|
||||
## Important Constraints
|
||||
- **NEVER summarize**: Include all information from original transcription
|
||||
- **NEVER add new substantive information**: Only reorganize and clarify existing content
|
||||
- **NEVER change the speaker's meaning or intent**
|
||||
- **DO add section headings**: These are structural additions that help readability
|
||||
- **Preserve technical terms and specific names** exactly as intended (fix transcription errors only)
|
||||
|
||||
## Heading Guidelines
|
||||
- Section headings should be **concise and descriptive** (3-8 words typically)
|
||||
- Headings should reflect the **actual content** of each section
|
||||
- Use appropriate heading levels (H1 for main title, H2 for major sections, H3 for subsections)
|
||||
- Headings are added to **improve navigation**, not to interpret or editorialize
|
||||
|
||||
## Language Support
|
||||
This skill works with transcriptions in any language:
|
||||
- **Chinese** (Mandarin, Cantonese, etc.)
|
||||
- **English**
|
||||
- **Mixed language** content (preserve code-switching naturally)
|
||||
- **Other languages** (apply language-appropriate grammar rules)
|
||||
|
||||
## OUTPUT INSTRUCTIONS
|
||||
|
||||
**CRITICAL**: Your response must contain ONLY the cleaned markdown document.
|
||||
|
||||
**DO NOT include**:
|
||||
- Any explanations before the document
|
||||
- Any explanations after the document
|
||||
- Any meta-commentary about the cleaning process
|
||||
- Any acknowledgments like "Here is the cleaned version"
|
||||
- Any markdown code fences (no ```markdown blocks)
|
||||
- Any introductory or concluding remarks
|
||||
|
||||
**START your response immediately with the H1 title** (# [Title]) and **END immediately after the last content paragraph**.
|
||||
|
||||
Your entire response = the cleaned document itself, nothing more.
|
||||
|
||||
## INPUT TRANSCRIPTION:
|
||||
""".strip()
|
||||
|
||||
|
||||
def cleanup_transcript(path_transcript: Path):
|
||||
"""
|
||||
Clean up audio transcription and save to file.
|
||||
|
||||
Args:
|
||||
path_transcript: Path to the raw transcript file
|
||||
"""
|
||||
# Read transcript content
|
||||
content = path_transcript.read_text(encoding="utf-8")
|
||||
|
||||
# Call Claude CLI to process the transcript
|
||||
args = [
|
||||
"claude",
|
||||
"--append-system-prompt",
|
||||
prompt,
|
||||
"--print",
|
||||
content,
|
||||
]
|
||||
|
||||
result = subprocess.run(
|
||||
args,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"Claude CLI failed: {result.stderr}")
|
||||
|
||||
# Save the cleaned output
|
||||
cleaned_transcript = result.stdout
|
||||
return cleaned_transcript
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Main CLI entry point for cleaning up audio transcriptions.
|
||||
|
||||
This script transforms messy voice transcription text into well-formatted,
|
||||
human-readable markdown documents while preserving the original meaning.
|
||||
|
||||
Example usage:
|
||||
python audio_transcription_cleanup.py --transcript-file "/path/to/transcript.txt"
|
||||
|
||||
What it does:
|
||||
- Removes filler words and verbal artifacts (um, uh, like, etc.)
|
||||
- Fixes obvious spelling and grammar errors
|
||||
- Adds semantic paragraph breaks and section headings
|
||||
- Preserves all original information (no summarization)
|
||||
- Maintains the speaker's meaning and intent
|
||||
|
||||
Requirements:
|
||||
- Transcript file must exist
|
||||
- Claude CLI must be installed and accessible
|
||||
"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Clean up audio transcriptions into well-formatted markdown documents",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Use default output location (allows overwrite)
|
||||
%(prog)s --transcript-file "~/tmp/transcript.txt"
|
||||
|
||||
# Specify custom output location (cannot overwrite existing file)
|
||||
%(prog)s --transcript-file "~/tmp/transcript.txt" --output "~/Documents/cleaned.md"
|
||||
|
||||
Cleanup Operations:
|
||||
- Remove verbal artifacts: um, uh, like, you know, 呃, 啊, 那个
|
||||
- Fix spelling and grammar errors
|
||||
- Add semantic paragraphs and section headings
|
||||
- Convert spoken fragments into complete sentences
|
||||
- Preserve all original information
|
||||
|
||||
Output Behavior:
|
||||
- Default location (~/tmp/cleaned_transcript.md): Allows overwrite
|
||||
- Custom location: Cannot overwrite existing files (will raise error)
|
||||
|
||||
Note:
|
||||
This script uses Claude CLI to perform intelligent transcript cleanup.
|
||||
All original information is preserved - no summarization occurs.
|
||||
""",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--transcript-file",
|
||||
type=str,
|
||||
required=True,
|
||||
help=f"Path to the transcript file to clean up (required)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default=None,
|
||||
help=f"Output file path for cleaned transcript (default: {path_cleaned_transcript}). Note: Default location allows overwrite, custom locations cannot overwrite existing files.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Convert to Path objects and expand user home directory
|
||||
path_transcript = Path(args.transcript_file).expanduser()
|
||||
|
||||
# Determine output path
|
||||
if args.output:
|
||||
path_output = Path(args.output).expanduser()
|
||||
# For custom locations, check if file already exists
|
||||
if (path_output != path_cleaned_transcript) and path_output.exists():
|
||||
raise FileExistsError(
|
||||
f"Output file already exists at {path_output}. "
|
||||
f"Please choose a different location or remove the existing file. "
|
||||
f"(Default location {path_cleaned_transcript} allows overwrite)"
|
||||
)
|
||||
else:
|
||||
# Use default location (allows overwrite)
|
||||
path_output = path_cleaned_transcript
|
||||
|
||||
# Ensure output directory exists
|
||||
path_output.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Clean up the transcript
|
||||
cleaned_transcript = cleanup_transcript(
|
||||
path_transcript=path_transcript,
|
||||
)
|
||||
|
||||
# Save to output file
|
||||
path_output.write_text(cleaned_transcript, encoding="utf-8")
|
||||
|
||||
# Print success message with clickable file path
|
||||
absolute_path = path_output.resolve()
|
||||
print(f"✓ Transcript cleanup completed successfully!")
|
||||
print(f"✓ Cleaned transcript saved to: file://{absolute_path}")
|
||||
print(f"\nClick the link above to open the document.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user