zhongwei/gh-codethread-claude-code-plugins-plugins-doc-writer

Files

Zhongwei Li 33b13a8d5e Initial commit

2025-11-29 18:14:54 +08:00

22 KiB

Raw Blame History

LLM Documentation Pitfalls: Problems and Countermeasures

This reference documents systematic quality issues in LLM-generated technical documentation, based on 2024-2025 research across academic papers, industry studies, and developer communities.

Overview

Research shows that even advanced models like GPT-4 and Claude-3 produce correct code only 65.2% and 56.7% of the time respectively. Over 40% of AI-generated code contains security vulnerabilities. Beyond code quality, LLM documentation suffers from stylistic tells, factual hallucinations, missing context, and formulaic patterns that reduce trust and usability.

Key insight: These issues are systematic, not random. They stem from fundamental LLM characteristics: probabilistic generation, training data limitations, and lack of verification mechanisms.

CATEGORY 1: Accuracy and Factual Issues

1.1 Hallucinations

Problem: LLMs confidently generate false information—non-existent APIs, fabricated parameters, invented research papers.

Why it occurs:

Predict plausible text sequences, not verified facts
19.7% package hallucination rate (205,474+ unique fake packages)
50%+ failures from non-existent API invocations
Training data contamination with outdated/incorrect code

Examples:

# HALLUCINATED - pandas.DataFrame.merge_smart() doesn't exist
df1.merge_smart(df2, on='id')

# HALLUCINATED - requests.get_json() isn't a method
data = requests.get_json('https://api.example.com')

Detection strategies:

Semantic entropy analysis (most robust method)
Multi-sampling consistency checks
Automated validation against package registries
API stub verification

Countermeasures:

Use RAG (Retrieval-Augmented Generation) with current docs
Verify all APIs against official documentation
Test all code examples before publication
Flag low-frequency APIs for extra review

1.2 Outdated and Version-Specific Information

Problem: Training data cutoffs cause documentation to reference deprecated methods and obsolete patterns.

Why it occurs:

Static training data (e.g., GPT-4o trained on DuckDB 0.10.2 while current is 1.3.2)
Frequency bias: deprecated high-frequency APIs get generated preferentially
70-90% deprecation usage rate when context is outdated
"3-month-old documentation can be completely outdated"

Examples:

# Deprecated in Python 3.2, removed in 3.9
for child in root.getchildren():  # LLMs still suggest this
    process(child)

# Modern approach:
for child in root:
    process(child)

Countermeasures:

Always specify exact versions in documentation
Use RAG to fetch current API documentation
Run linters that flag deprecated usage
Include version compatibility matrices
Regular documentation audits for version drift

1.3 Mixed Accurate/Inaccurate Information

Problem: Most dangerous pattern—correct elements mask errors, making detection difficult even for experts.

Why it occurs:

LLMs blend information from multiple sources
High-confidence patterns from different versions get mixed
Temporal conflation (can't distinguish "2020 code" vs "2024 code")

Examples:

# Mixing Python 2 and Python 3
from __future__ import print_function  # Python 2 compatibility
response = urllib2.urlopen(url)  # Python 2 (removed in Python 3)
data = response.read().decode('utf-8')  # Python 3 style

Countermeasures:

Test entire code examples, not just parts
Cross-reference multiple authoritative sources
Use version-pinned dependency files in context
Automated compatibility checking

CATEGORY 2: Code Quality Issues

2.1 Code That Doesn't Work

Problem: 35-70% failure rate depending on model. Contains syntax errors, logic flaws, off-by-one errors.

Why it occurs:

Token-by-token prediction doesn't ensure correctness
Training objective optimizes for pattern matching, not accuracy
No execution verification during generation

Examples:

# Logic error: Missing duplicate detection
def find_unique(items):
    result = []
    for item in items:
        result.append(item)  # Should check if already in result
    return result

# Math error: Wrong operation
average = sum(values) * len(values)  # Should divide, not multiply

# Off-by-one error
middle = arr[len(arr) // 2 + 1]  # Wrong for even-length arrays

Countermeasures:

Execute all examples in sandboxed environments
Automated testing with comprehensive test suites
Static analysis (linters, type checkers)
Iterative refinement (LlmFix-style approaches show 9.5% improvement)

2.2 Incomplete Code Examples

Problem: Missing imports, setup steps, configuration, and contextual information needed for execution.

Why it occurs:

Context window limitations
Training data fragmentation (snippets lack full context)
Implicit knowledge assumption

Examples:

# Generated code
df = pd.read_csv('data.csv')

# Missing:
# import pandas as pd
# pip install pandas

Countermeasures:

Prompt for "complete, runnable code with all imports"
Dependency checking against registries (npm, PyPI, etc.)
Template-based generation with required structure
"Copy-paste ready" standard

2.3 Security Vulnerabilities

Problem: 40%+ of AI-generated code contains security flaws. Java shows 70%+ security failure rate.

Why it occurs:

Training on vulnerable code from open-source datasets
Security not prioritized unless explicitly prompted
Simpler insecure patterns appear more frequently

Examples:

# SQL Injection vulnerability
def get_user(username):
    query = f"SELECT * FROM users WHERE username = '{username}'"
    return db.execute(query)
# Attack: username = "admin' OR '1'='1"

# Secure version:
def get_user(username):
    query = "SELECT * FROM users WHERE username = ?"
    return db.execute(query, (username,))

# Unsafe YAML loading
config = yaml.load(f)  # Allows arbitrary code execution

# Secure:
config = yaml.safe_load(f)

Countermeasures:

Security-focused prompts ("secure code with input validation")
Static security analysis (SAST tools: Snyk, Semgrep, CodeQL)
Dependency vulnerability scanning
Security review for critical code paths
Default to secure patterns in templates

2.4 Missing Error Handling and Edge Cases

Problem: Implements "happy path" only. Systematically overlooks null values, boundary conditions, empty collections.

Why it occurs:

Training data bias toward success cases
Error handling adds verbosity
Edge cases underrepresented in training data

Examples:

# Crashes on empty input
def get_average(numbers):
    return sum(numbers) / len(numbers)  # ZeroDivisionError when numbers = []

# Missing null checks
def getUserName(user):
    return user.profile.name.toUpperCase()  # Crashes if any part is null

# No boundary checking
def get_percentage(value, total):
    return (value / total) * 100  # Crashes if total = 0

Countermeasures:

Explicit prompts: "Handle null values, empty arrays, and boundary conditions"
Test-driven approach with edge case test suites
Defensive programming templates
Automated edge case testing (null, empty, max values, Unicode)

2.5 Inconsistent Code Style

Problem: 24 types of style inconsistencies across 5 dimensions, affecting maintainability despite functional correctness.

Why it occurs:

Training data diversity (varied conventions)
Context-dependent generation
Systematic biases (avoiding blank lines, avoiding comprehensions)

Examples:

# Inconsistent spacing
def process1(items):
    result=[]  # No spaces

def process2(items):
    result = []  # With spaces

# Inconsistent naming
const userId = getCurrentUser();  # camelCase
const user_data = fetchUserData(userId);  # snake_case
const UserSettings = loadSettings();  # PascalCase

Countermeasures:

Specify style guides in prompts ("Follow PEP 8")
Automated formatters (Black, Prettier, gofmt)
Linter enforcement
Few-shot examples with consistent style

CATEGORY 3: Writing Style Problems

3.1 Verbal Tics and Repetitive Patterns

Problem: Overuse of specific phrases makes AI authorship instantly recognizable.

Common tells:

"It's worth noting that"
"Keep in mind"
"Delve into"
"In the realm of"
"Tapestry"
"Landscape"
"Leverage"
"Robust"
"Pivotal"

Why it occurs: Statistical attractors—phrases that co-occur frequently in training data.

Examples:

<!-- LLM-generated -->
It's worth noting that we should delve into the pivotal role of machine learning
in the realm of automation. That being said, it's important to note that this
cutting-edge technology leverages sophisticated algorithms.

<!-- Better -->
Machine learning plays a crucial role in automation. This advanced technology
uses sophisticated algorithms.

Countermeasures:

Scan for telltale phrases and replace with plain language
Use varied vocabulary
Avoid formulaic transitions
Edit for conciseness

3.2 Over-Explanation and Verbosity

Problem: Padding with unnecessary words, redundant explanations, obvious statements.

Why it occurs:

RLHF reward models favor longer outputs
"Verbosity compensation" when uncertain
Lack of experience to fill space

Examples:

<!-- Verbose -->
In today's fast-paced business environment, it's important to note that companies
need to engage in the implementation of optimization strategies in order to
streamline their operations in a way that improves efficiency.

<!-- Concise -->
Companies should implement optimization strategies to streamline operations and
improve efficiency.

Countermeasures:

Cut redundant exposition
Remove generic opening padding
Use concise constructions ("to" not "in order to")
Delete "It goes without saying"

3.3 Hedging and Uncertainty Language Overuse

Problem: Excessive qualifying statements make content feel tentative and non-committal.

Common patterns:

"Generally speaking"
"Typically"
"Could potentially"
"To some extent"
"It could be argued that"

Why it occurs:

Risk avoidance (safe language)
Training on academic content
Probabilistic nature creates uncertainty

Examples:

<!-- Over-hedged -->
It could be argued that machine learning tends to generally improve efficiency
in most automation scenarios, and typically provides somewhat better results.

<!-- Direct -->
Machine learning improves efficiency in automation and provides better results.

Countermeasures:

Make direct claims where appropriate
Remove unnecessary qualifiers
Use confident language for established facts

3.4 Generic, Vague Descriptions

Problem: Replaces specific facts with generic descriptions that could apply to anything. "Regression to the mean."

Common patterns:

"Significant impact"
"Crucial role"
"Comprehensive approach"
"Game-changing"
"Innovative solutions"

Why it occurs: Generic statements are statistically common; specific facts are rare.

Examples:

<!-- Vague -->
The comprehensive AI implementation provided significant improvements across
various aspects of the business, delivering transformative results.

<!-- Specific -->
The AI implementation increased sales by 23%, reduced processing time from
4 hours to 30 minutes, and identified $2M in cost savings.

Countermeasures:

Use concrete numbers, dates, names
Provide specific examples
Cite actual sources
Replace "research shows" with named studies

3.5 Lack of Authentic Voice

Problem: Missing emotional depth, personal perspective, humor, and memorable communication.

Why it occurs:

No lived experience
Training for neutrality
Pattern matching without unique perspective

Missing elements:

Personal anecdotes
Humor or wit
Sensory details
Emotional language
Opinions
Conversational asides

Examples:

<!-- Generic -->
The Amalfi Coast is a beautiful destination with scenic views and pleasant weather.

<!-- Authentic -->
The Amalfi Coast hits you with the scent of lemon groves before you even see
the cliffs. The air tastes like salt and possibility, and the stone steps are
worn smooth by centuries of footsteps.

Countermeasures:

Add personal stories and experiences
Include sensory details
Use conversational tone
Share opinions and perspectives
Show vulnerability and humor

3.6 Formulaic Structures

Problem: Recognizable templates create "fill-in-the-blank" feel.

Common patterns:

Rule of three everywhere
"Not only...but also" overuse
Participial phrase endings ("-ing" phrases)
Gerund subjects ("Developing the new data...")
Every section has exactly 3-5 bullets

Examples:

<!-- Formulaic -->
Not only does machine learning improve accuracy, but it also enhances speed.
Implementing this technology provides significant benefits, marking a pivotal
moment in automation history. Whether you're a startup, a mid-sized company,
or an enterprise...

<!-- Natural -->
Machine learning improves both accuracy and speed. This technology transforms
automation. Companies of all sizes can benefit.

Countermeasures:

Vary sentence structure
Break the rule of three
Use natural flow instead of templates
Vary list lengths

CATEGORY 4: Content and Context Issues

4.1 Missing "Why" and Context

Problem: Documents what code does but not why decisions were made.

Why it occurs: "Code lacks business or product logic that went into coding something a specific way."

What's missing:

Reasoning behind design decisions
Business logic context
Alternatives considered
Project-specific constraints
Abandoned approaches (leading to re-exploration)

Examples:

# What's documented (what)
def process_batch(items, size=1000):
    for batch in chunks(items, size):
        process(batch)

# What's missing (why)
# Process in batches of 1000 to avoid memory exhaustion on large datasets.
# Batch size of 1000 determined through load testing - smaller batches
# increased overhead, larger caused OOM errors on production hardware.
# Alternative approaches tried: streaming (too slow), full load (OOM).

Countermeasures:

Explain reasoning, not just functionality
Document alternatives considered
Include business/product context
Explain constraints and trade-offs

4.2 Inconsistent Terminology

Problem: Using different terms for the same concept creates confusion.

Why it occurs: LLMs must infer whether terms are synonyms or distinct concepts, making probabilistic guesses.

Examples:

"API key," "access token," "auth credential" used interchangeably
"User," "account," "profile" without clear distinctions
Mixing technical terms across sections

Impact: "Inconsistency compounds over time, creating increasingly unreliable AI experience that erodes developer trust."

Countermeasures:

Create and enforce terminology glossary
Use consistent naming throughout
Define distinctions between similar terms
Review for terminology drift

4.3 Unrealistic or Toy Examples

Problem: Oversimplified scenarios that don't reflect production usage.

Why it occurs:

Training data bias toward tutorials
Happy path optimization
Simplification tendency

Common issues:

Hard-coded credentials
Missing error handling, retries, timeouts
Unrealistic scale (5 items instead of millions)
No authentication or authorization
Missing logging, monitoring, rate limiting

Examples:

# Toy example
def getUserData(userId):
    response = await fetch(`/api/users/${userId}`);
    return response.json();

# Production needs:
# - Error handling (network failures, 404s, 500s)
# - Timeout configuration
# - Retry logic with exponential backoff
# - Logging/monitoring
# - Request cancellation
# - Rate limiting
# - Authentication

Countermeasures:

Prompt for "production-ready code"
Security checklists (authentication, validation)
Real-world test scenarios
Flag hard-coded credentials
Require error handling

4.4 Lack of Warnings and Caveats

Problem: Presents information confidently without appropriate uncertainty markers or warnings.

Why it occurs: LLMs trained to sound authoritative.

Missing elements:

Security warnings
Version compatibility notes
Deprecation warnings
Performance implications
Prerequisites

Countermeasures:

Explicitly prompt for warnings
Add security review step
Include version compatibility matrices
Document prerequisites clearly

CATEGORY 5: Documentation Structure Issues

5.1 Over-Comprehensive Coverage

Problem: Tries to cover every aspect equally, creating information overload.

Why it occurs: "AI writing sounds comprehensive, balanced, covers every angle equally."

Impact:

Difficulty identifying what's important
Tedious reading
Key information buried

Countermeasures:

Prioritize common use cases
Mark advanced sections clearly
Use progressive disclosure
Focus on "80% of what users need"

5.2 Surface-Level Coverage

Problem: Provides broad, high-level explanations that avoid depth.

Why it occurs:

Limited understanding
Avoids taking strong stances
Cannot grasp sophisticated algorithms

Countermeasures:

Prompt for specific depth level
Request concrete examples
Ask for trade-off analysis
Include advanced sections for complex topics

LLM-Specific Quality Checklist

Before publishing LLM-assisted documentation, verify:

Code Quality

All code examples tested and working
No hallucinated APIs or libraries
All dependencies verified in registries
Versions specified and compatible
No deprecated methods used
Security vulnerabilities scanned (SAST)
Error handling for edge cases
Input validation present
No hard-coded credentials
Style consistent throughout

Content Quality

No telltale LLM phrases ("delve," "leverage," "realm")
Specific examples with concrete details
"Why" explained, not just "what"
Terminology consistent throughout
Authentic voice, not generic
Production-ready examples
Warnings and caveats included
Context provided for decisions

Structural Quality

Not over-comprehensive (focused on common needs)
Not surface-level (sufficient depth)
Progressive complexity
Clear prerequisites
Multiple learning pathways

Detection and Remediation Workflow

1. Automated Detection

Scan for telltale phrases
Run code through linters and SAST tools
Verify APIs against official documentation
Check dependencies in registries
Test code examples
Validate version compatibility

2. Manual Review

Check for missing "why" explanations
Verify terminology consistency
Assess authenticity of voice
Validate security practices
Review edge case handling

3. Remediation

Replace LLM verbal tics with plain language
Add context and reasoning
Make examples production-ready
Add error handling and security
Include specific, concrete details
Break formulaic templates
Test all code thoroughly

Best Practices for LLM-Assisted Documentation

Generation Phase

Use RAG with current documentation
Specify exact versions in prompts
Request security and error handling explicitly
Provide complete context (dependencies, constraints)
Ask for production-ready code
Request "why" explanations, not just "what"

Editing Phase

Remove telltale phrases and verbal tics
Add specificity: Replace generic with concrete
Verify all APIs against official docs
Test all code in target environment
Scan for security vulnerabilities
Add personal voice and experience
Include edge cases and error handling
Ensure consistency in terminology and style

Review Phase

Security review for critical paths
Expert review for technical accuracy
User testing with actual developers
Version compatibility validation
Accessibility check (alt text, headings)

Key Takeaways

Most Critical Issues to Address

Hallucinations (19.7% package rate, 50%+ API failures)
Security vulnerabilities (40%+ prevalence)
Missing edge case handling (systematic)
Outdated information (training cutoffs)
Telltale phrases (instant recognition)
Missing context ("why" not documented)

Universal Principles

AI as assistant, not replacement - Human review essential
Test everything - Never publish untested code
Verify facts - Check all APIs and versions
Add context - Explain reasoning and trade-offs
Edit ruthlessly - Remove artificial patterns
Prioritize security - Scan and review

Success Formula

RAG + Explicit Prompts + Automated Testing + Human Review + Iterative Editing = Quality Documentation

Sources

This research synthesizes findings from:

Academic Research:

arxiv:2411.01414 (Code generation mistakes)
arxiv:2407.09726 (API hallucinations)
arxiv:2409.20550 (Practical hallucinations)
arxiv:2407.00456 (Style inconsistencies)
arxiv:2304.10778 (Code quality evaluation)
Nature publication on semantic entropy

Industry Studies:

Endor Labs (security vulnerabilities)
Stanford study (AI coding assistants)
Amazon Science (package hallucinations)

Technical Writing Resources:

Wikipedia (AI writing signs)
Grammarly blog (AI phrases)
Write the Docs community
Technical Writer HQ
Multiple developer blogs and forums

Statistics Referenced:

GPT-4: 65.2% correctness
Claude-3: 56.7% correctness
Security failures: 40%+ (70%+ Java)
Package hallucinations: 19.7%
API failures: 50%+ for low-frequency APIs

22 KiB Raw Blame History

LLM Documentation Pitfalls: Problems and Countermeasures

Overview

CATEGORY 1: Accuracy and Factual Issues

1.1 Hallucinations

1.2 Outdated and Version-Specific Information

1.3 Mixed Accurate/Inaccurate Information

CATEGORY 2: Code Quality Issues

2.1 Code That Doesn't Work

2.2 Incomplete Code Examples

2.3 Security Vulnerabilities

2.4 Missing Error Handling and Edge Cases

2.5 Inconsistent Code Style

CATEGORY 3: Writing Style Problems

3.1 Verbal Tics and Repetitive Patterns

3.2 Over-Explanation and Verbosity

3.3 Hedging and Uncertainty Language Overuse

3.4 Generic, Vague Descriptions

3.5 Lack of Authentic Voice

3.6 Formulaic Structures

CATEGORY 4: Content and Context Issues

4.1 Missing "Why" and Context

4.2 Inconsistent Terminology

4.3 Unrealistic or Toy Examples

4.4 Lack of Warnings and Caveats

CATEGORY 5: Documentation Structure Issues

5.1 Over-Comprehensive Coverage

5.2 Surface-Level Coverage

LLM-Specific Quality Checklist

Code Quality

Content Quality

Structural Quality

Detection and Remediation Workflow

1. Automated Detection

2. Manual Review

3. Remediation

Best Practices for LLM-Assisted Documentation

Generation Phase

Editing Phase

Review Phase

Key Takeaways

Most Critical Issues to Address

Universal Principles

Success Formula

Sources

22 KiB

Raw Blame History