zhongwei/gh-k-dense-ai-claude-scientific-writer

Files

Zhongwei Li 1dd5bee3b4 Initial commit

2025-11-30 08:30:14 +08:00

9.6 KiB

Raw Permalink Blame History

Paper2Video: Presentation Video Generation

Overview

Paper2Video generates presentation videos from LaTeX sources, transforming academic papers into engaging video presentations. The system processes papers through multiple specialized modules to create professional presentation videos complete with slides, narration, and optional talking-head video.

Core Components

1. Slide Generation Module

Extracts key content from paper structure
Creates visually appealing presentation slides
Organizes content in logical flow
Includes figures, tables, and equations
Optimizes text density for readability

2. Subtitle Generation Module

Generates natural presentation script
Synchronizes text with slide transitions
Creates speaker notes and timing
Supports multiple languages
Optimizes for speech synthesis

3. Speech Synthesis Module

Converts subtitles to natural-sounding speech
Supports multiple voices and accents
Controls pacing and emphasis
Generates audio track for video
Handles technical terminology

4. Cursor Movement Module

Simulates presenter cursor movements
Highlights key points on slides
Guides viewer attention
Creates natural presentation flow
Synchronizes with narration

5. Talking-Head Video Generation (Optional)

Uses Hallo2 for realistic presenter video
Lip-syncs with generated audio
Requires reference image or video
GPU-intensive (NVIDIA A6000 48GB minimum)
Creates engaging presenter presence

Usage

Basic Video Generation (Without Talking-Head)

python pipeline_light.py \
  --model_name_t gpt-4.1 \
  --model_name_v gpt-4.1 \
  --result_dir /path/to/output \
  --paper_latex_root /path/to/paper

Full Video Generation (With Talking-Head)

python pipeline_all.py \
  --input-dir "path/to/papers" \
  --output-dir "path/to/output" \
  --model-choice 1 \
  --enable-talking-head

Parameters

Model Configuration:

--model_name_t: Model for text/subtitle generation (default: gpt-4.1)
--model_name_v: Model for visual/slide generation (default: gpt-4.1)
--model-choice: Preset model configuration (1=GPT-4, 2=GPT-4.1)

Input/Output:

--paper_latex_root: Root directory of LaTeX paper source
--result_dir or --output-dir: Output directory for generated videos
--input-dir: Directory containing multiple papers to process

Video Options:

--enable-talking-head: Enable talking-head video generation (requires GPU)
--video-duration: Target video duration in seconds (default: auto-calculated)
--slides-per-minute: Control presentation pacing (default: 2-3)
--voice: Voice selection for speech synthesis

Quality Settings:

--video-resolution: Output resolution (default: 1920x1080)
--video-fps: Frame rate (default: 30)
--audio-quality: Audio bitrate (default: 192kbps)

Input Requirements

LaTeX Source Structure

paper_directory/
├── main.tex              # Main paper file
├── sections/             # Section files (if split)
│   ├── introduction.tex
│   ├── methods.tex
│   └── results.tex
├── figures/              # Figure files
│   ├── fig1.pdf
│   ├── fig2.png
│   └── ...
├── tables/               # Table files
└── bibliography.bib      # References

Required Elements

Valid LaTeX source that compiles
Proper section structure (abstract, introduction, methods, results, conclusion)
High-quality figures (vector formats preferred)
Complete bibliography

Optional Elements

Author photos for talking-head generation
Custom slide templates
Background music or sound effects
Institution branding assets

Output Structure

output/paper_name/video/
├── final_video.mp4           # Complete presentation video
├── slides/                   # Generated slide images
│   ├── slide_001.png
│   ├── slide_002.png
│   └── ...
├── audio/                    # Audio components
│   ├── narration.mp3         # Speech synthesis output
│   └── background.mp3        # Optional background audio
├── subtitles/                # Subtitle files
│   ├── subtitles.srt         # Standard subtitle format
│   └── subtitles.vtt         # WebVTT format
├── script/                   # Presentation script
│   ├── full_script.txt       # Complete narration text
│   └── slide_notes.json      # Slide-by-slide notes
└── metadata/                 # Video metadata
    ├── timings.json          # Slide timing information
    └── video_info.json       # Video properties

Video Generation Process

Phase 1: Content Analysis

Parse LaTeX source structure
Extract key concepts and findings
Identify important figures and equations
Determine logical presentation flow

Phase 2: Slide Creation

Design slide layouts based on content
Allocate content across appropriate number of slides
Incorporate figures and visual elements
Apply consistent styling and branding

Phase 3: Script Generation

Write natural presentation narration
Time script sections to slides
Add transitions and emphasis
Optimize for speech synthesis

Phase 4: Audio Production

Generate speech from script
Add emphasis and pacing
Include pauses for slide transitions
Mix with optional background audio

Phase 5: Video Assembly

Combine slides with timing information
Synchronize audio track
Add cursor movements and highlights
Generate talking-head video (if enabled)
Render final video file

Customization Options

Presentation Style

Academic: Formal, detailed, comprehensive
Conference: Focused on key findings, faster pace
Public: Simplified language, engaging storytelling
Tutorial: Step-by-step explanation, educational focus

Voice Configuration

Available voice options (via speech synthesis):

Multiple languages and accents
Male/female voice selection
Speaking rate adjustment
Pitch and tone customization

Visual Themes

Institution branding colors
Conference template matching
Custom backgrounds and fonts
Dark mode presentations

Quality Assessment

Content Quality Metrics

Completeness: Coverage of paper content
Clarity: Explanation quality and coherence
Flow: Logical progression of ideas
Engagement: Visual appeal and pacing

Technical Quality Metrics

Audio quality: Speech clarity and naturalness
Video quality: Resolution and encoding
Synchronization: Audio-visual alignment
Timing: Appropriate slide duration

Advanced Features

Multi-Language Support

Generate presentations in multiple languages
Automatic translation of script
Language-appropriate voice selection
Cultural adaptation of presentation style

Talking-Head Generation with Hallo2

Requires:

NVIDIA A6000 GPU (48GB minimum)
Reference image or short video of presenter
Additional processing time (2-3x longer)

Benefits:

More engaging presentation
Professional presenter appearance
Natural gestures and expressions
Lip-sync accuracy

Interactive Elements

Embedded clickable links
Navigation menu
Chapter markers
Supplementary material links

Best Practices

Input Preparation

Clean LaTeX source: Remove unnecessary comments and artifacts
High-quality figures: Use vector formats when possible
Clear structure: Well-organized sections and subsections
Complete content: Include all necessary files and references

Model Selection

Text generation (model_name_t): GPT-4.1 for best script quality
Visual generation (model_name_v): GPT-4.1 for optimal slide design
For faster processing with acceptable quality: GPT-3.5-turbo

Video Optimization

Target duration: 10-15 minutes for conference talks, 30-45 for detailed presentations
Pacing: 2-3 slides per minute for technical content
Resolution: 1920x1080 for standard, 3840x2160 for high-quality
Audio: 192kbps minimum for clear speech

Quality Review

Before finalizing:

Watch entire video for content accuracy
Check audio synchronization with slides
Verify figure quality and readability
Test subtitle accuracy and timing
Review cursor movements for natural flow

Performance Considerations

Processing Time

Without talking-head: 10-30 minutes per paper (depending on length)
With talking-head: 30-120 minutes per paper
Factors: Paper length, figure count, model speed, GPU availability

Resource Requirements

CPU: Multi-core recommended for parallel processing
RAM: 16GB minimum, 32GB for large papers
GPU: Optional for standard, required for talking-head (A6000 48GB)
Storage: 1-5GB per video depending on length and quality

Troubleshooting

Common Issues

1. LaTeX parsing errors

Ensure LaTeX source compiles successfully
Check for special packages or custom commands
Verify all referenced files are present

2. Speech synthesis problems

Check audio quality settings
Verify text is properly formatted
Test with different voice options

3. Video rendering failures

Check available disk space
Verify all dependencies are installed
Review error logs for specific issues

4. Talking-head generation errors

Confirm GPU memory (48GB required)
Check CUDA drivers are up to date
Verify reference image quality and format

Integration with Other Components

Combine Paper2Video with:

Paper2Web: Embed video in generated website
Paper2Poster: Use matching visual style
AutoPR: Create promotional clips from full video

9.6 KiB Raw Permalink Blame History