# -*- coding: utf-8 -*- """ Audio Transcription Cleanup Script This module transforms messy voice transcription text into well-formatted, human-readable markdown documents using Claude CLI. It preserves all original meaning and content while improving readability through intelligent text cleanup and restructuring. Key Features: - Remove verbal artifacts (um, uh, like, filler words in any language) - Fix spelling and grammar errors - Add semantic paragraph breaks and section headings - Support for single-speaker and multi-speaker content - Multi-language support (English, Chinese, and more) - Preserve all original information (no summarization) - Default output location with overwrite capability - Custom output locations with overwrite protection Default Behavior: - Output: ~/tmp/cleaned_transcript.md (allows overwrite) - Language: Auto-detected from source - Format: Well-structured markdown with headings Custom Output: - Cannot overwrite existing files (raises FileExistsError) - Creates parent directories automatically Text Cleanup Operations: - Remove filler words: um, uh, like, you know, 呃, 啊, 那个, 就是说 - Fix obvious noun errors and typos - Correct grammar mistakes - Convert spoken fragments to complete sentences - Add descriptive section headings - Organize content into semantic paragraphs Example Usage: # Use default output location (allows overwrite) $ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt # Specify custom output location (cannot overwrite existing file) $ python audio_transcription_cleanup.py --transcript-file /path/to/transcript.txt --output ~/Documents/cleaned.md Requirements: - Claude CLI must be installed and accessible - Transcript file must exist at specified path Author: sanhe Plugin: youtube@sanhe-claude-code-plugins """ import subprocess import argparse from pathlib import Path dir_home = Path.home() path_cleaned_transcript = dir_home / "tmp" / "cleaned_transcript.md" prompt = """ ## Task Transform the messy voice transcription text provided below into a well-formatted, human-readable document while preserving ALL original meaning and content. ## Key Principles - **Preserve original meaning**: Do not summarize or omit information - **Fix transcription artifacts**: Remove filler words, false starts, and repetitions - **Improve readability**: Organize into proper paragraphs with clear structure - **Handle multi-speaker content**: Clearly attribute dialogue when multiple speakers are present - **Enhance document structure**: Add semantic paragraphing and descriptive headings ## Processing Instructions ### Step 1: Analyze the Input Examine the transcription to identify: - **Primary language**: Determine the dominant language of the transcription - **Number of speakers**: Single vs. multi-speaker content - **Main topics or themes**: Identify distinct topics for sectioning - **Transcription quality issues**: Filler words, repetitions, false starts, obvious errors ### Step 2: Text Cleanup Operations 1. **Remove verbal artifacts**: - Filler words: "um", "uh", "like", "you know", "呃", "啊", "那个", "就是说", "嗯", "然后" - False starts and self-corrections - Unnecessary repetitions 2. **Fix errors and improve accuracy**: - **Correct obvious noun errors**: Fix misrecognized names, places, technical terms - **Fix spelling mistakes**: Correct typos and transcription errors - **Fix grammar errors**: Subject-verb agreement, tense consistency, word order - **Clarify ambiguous terms**: Use context to determine correct words when transcription is unclear 3. **Organize content structure**: - **Semantic paragraphing**: Group related ideas into logical paragraphs based on meaning - **Add section headings**: Create descriptive headings that summarize each section's content - Convert spoken fragments into complete sentences - Create logical flow between sections 4. **For multi-speaker content**: - Use clear speaker labels (Speaker A, B, or actual names if identified) - Format as dialogue or meeting notes - Preserve conversational context ### Step 3: Language and Formatting **Language Selection Rules** (in priority order): 1. If user explicitly specifies output language → use that language 2. Otherwise → use the primary language of the original transcription 3. Preserve technical terms and proper nouns in their original language/form **For single-speaker content**: ```markdown # [Main Topic Title] ## [Section 1 Heading] [Cleaned paragraph 1 content...] [Cleaned paragraph 2 content...] ## [Section 2 Heading] [Cleaned content...] ``` **For multi-speaker content** - Option 1 (Dialogue format): ```markdown # [Meeting/Discussion Title] ## [Topic 1] **Speaker A:** [cleaned content] **Speaker B:** [cleaned content] ## [Topic 2] **Speaker A:** [cleaned content] ``` **For multi-speaker content** - Option 2 (Meeting notes format): ```markdown # [Meeting Title] ## [Discussion Topic 1] Speaker A mentioned that [cleaned content]... Speaker B responded by explaining [cleaned content]... ## [Discussion Topic 2] The group discussed [cleaned content]... ``` ## Important Constraints - **NEVER summarize**: Include all information from original transcription - **NEVER add new substantive information**: Only reorganize and clarify existing content - **NEVER change the speaker's meaning or intent** - **DO add section headings**: These are structural additions that help readability - **Preserve technical terms and specific names** exactly as intended (fix transcription errors only) ## Heading Guidelines - Section headings should be **concise and descriptive** (3-8 words typically) - Headings should reflect the **actual content** of each section - Use appropriate heading levels (H1 for main title, H2 for major sections, H3 for subsections) - Headings are added to **improve navigation**, not to interpret or editorialize ## Language Support This skill works with transcriptions in any language: - **Chinese** (Mandarin, Cantonese, etc.) - **English** - **Mixed language** content (preserve code-switching naturally) - **Other languages** (apply language-appropriate grammar rules) ## OUTPUT INSTRUCTIONS **CRITICAL**: Your response must contain ONLY the cleaned markdown document. **DO NOT include**: - Any explanations before the document - Any explanations after the document - Any meta-commentary about the cleaning process - Any acknowledgments like "Here is the cleaned version" - Any markdown code fences (no ```markdown blocks) - Any introductory or concluding remarks **START your response immediately with the H1 title** (# [Title]) and **END immediately after the last content paragraph**. Your entire response = the cleaned document itself, nothing more. ## INPUT TRANSCRIPTION: """.strip() def cleanup_transcript(path_transcript: Path): """ Clean up audio transcription and save to file. Args: path_transcript: Path to the raw transcript file """ # Read transcript content content = path_transcript.read_text(encoding="utf-8") # Call Claude CLI to process the transcript args = [ "claude", "--append-system-prompt", prompt, "--print", content, ] result = subprocess.run( args, capture_output=True, text=True, ) if result.returncode != 0: raise RuntimeError(f"Claude CLI failed: {result.stderr}") # Save the cleaned output cleaned_transcript = result.stdout return cleaned_transcript def main(): """ Main CLI entry point for cleaning up audio transcriptions. This script transforms messy voice transcription text into well-formatted, human-readable markdown documents while preserving the original meaning. Example usage: python audio_transcription_cleanup.py --transcript-file "/path/to/transcript.txt" What it does: - Removes filler words and verbal artifacts (um, uh, like, etc.) - Fixes obvious spelling and grammar errors - Adds semantic paragraph breaks and section headings - Preserves all original information (no summarization) - Maintains the speaker's meaning and intent Requirements: - Transcript file must exist - Claude CLI must be installed and accessible """ parser = argparse.ArgumentParser( description="Clean up audio transcriptions into well-formatted markdown documents", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: # Use default output location (allows overwrite) %(prog)s --transcript-file "~/tmp/transcript.txt" # Specify custom output location (cannot overwrite existing file) %(prog)s --transcript-file "~/tmp/transcript.txt" --output "~/Documents/cleaned.md" Cleanup Operations: - Remove verbal artifacts: um, uh, like, you know, 呃, 啊, 那个 - Fix spelling and grammar errors - Add semantic paragraphs and section headings - Convert spoken fragments into complete sentences - Preserve all original information Output Behavior: - Default location (~/tmp/cleaned_transcript.md): Allows overwrite - Custom location: Cannot overwrite existing files (will raise error) Note: This script uses Claude CLI to perform intelligent transcript cleanup. All original information is preserved - no summarization occurs. """, ) parser.add_argument( "--transcript-file", type=str, required=True, help=f"Path to the transcript file to clean up (required)", ) parser.add_argument( "--output", type=str, default=None, help=f"Output file path for cleaned transcript (default: {path_cleaned_transcript}). Note: Default location allows overwrite, custom locations cannot overwrite existing files.", ) args = parser.parse_args() # Convert to Path objects and expand user home directory path_transcript = Path(args.transcript_file).expanduser() # Determine output path if args.output: path_output = Path(args.output).expanduser() # For custom locations, check if file already exists if (path_output != path_cleaned_transcript) and path_output.exists(): raise FileExistsError( f"Output file already exists at {path_output}. " f"Please choose a different location or remove the existing file. " f"(Default location {path_cleaned_transcript} allows overwrite)" ) else: # Use default location (allows overwrite) path_output = path_cleaned_transcript # Ensure output directory exists path_output.parent.mkdir(parents=True, exist_ok=True) # Clean up the transcript cleaned_transcript = cleanup_transcript( path_transcript=path_transcript, ) # Save to output file path_output.write_text(cleaned_transcript, encoding="utf-8") # Print success message with clickable file path absolute_path = path_output.resolve() print(f"✓ Transcript cleanup completed successfully!") print(f"✓ Cleaned transcript saved to: file://{absolute_path}") print(f"\nClick the link above to open the document.") if __name__ == "__main__": main()