775 lines
22 KiB
Markdown
775 lines
22 KiB
Markdown
# LLM Documentation Pitfalls: Problems and Countermeasures
|
|
|
|
This reference documents systematic quality issues in LLM-generated technical documentation, based on 2024-2025 research across academic papers, industry studies, and developer communities.
|
|
|
|
## Overview
|
|
|
|
Research shows that even advanced models like GPT-4 and Claude-3 produce correct code only 65.2% and 56.7% of the time respectively. Over 40% of AI-generated code contains security vulnerabilities. Beyond code quality, LLM documentation suffers from stylistic tells, factual hallucinations, missing context, and formulaic patterns that reduce trust and usability.
|
|
|
|
**Key insight**: These issues are systematic, not random. They stem from fundamental LLM characteristics: probabilistic generation, training data limitations, and lack of verification mechanisms.
|
|
|
|
---
|
|
|
|
## CATEGORY 1: Accuracy and Factual Issues
|
|
|
|
### 1.1 Hallucinations
|
|
|
|
**Problem**: LLMs confidently generate false information—non-existent APIs, fabricated parameters, invented research papers.
|
|
|
|
**Why it occurs**:
|
|
- Predict plausible text sequences, not verified facts
|
|
- 19.7% package hallucination rate (205,474+ unique fake packages)
|
|
- 50%+ failures from non-existent API invocations
|
|
- Training data contamination with outdated/incorrect code
|
|
|
|
**Examples**:
|
|
```python
|
|
# HALLUCINATED - pandas.DataFrame.merge_smart() doesn't exist
|
|
df1.merge_smart(df2, on='id')
|
|
|
|
# HALLUCINATED - requests.get_json() isn't a method
|
|
data = requests.get_json('https://api.example.com')
|
|
```
|
|
|
|
**Detection strategies**:
|
|
- Semantic entropy analysis (most robust method)
|
|
- Multi-sampling consistency checks
|
|
- Automated validation against package registries
|
|
- API stub verification
|
|
|
|
**Countermeasures**:
|
|
- Use RAG (Retrieval-Augmented Generation) with current docs
|
|
- Verify all APIs against official documentation
|
|
- Test all code examples before publication
|
|
- Flag low-frequency APIs for extra review
|
|
|
|
---
|
|
|
|
### 1.2 Outdated and Version-Specific Information
|
|
|
|
**Problem**: Training data cutoffs cause documentation to reference deprecated methods and obsolete patterns.
|
|
|
|
**Why it occurs**:
|
|
- Static training data (e.g., GPT-4o trained on DuckDB 0.10.2 while current is 1.3.2)
|
|
- Frequency bias: deprecated high-frequency APIs get generated preferentially
|
|
- 70-90% deprecation usage rate when context is outdated
|
|
- "3-month-old documentation can be completely outdated"
|
|
|
|
**Examples**:
|
|
```python
|
|
# Deprecated in Python 3.2, removed in 3.9
|
|
for child in root.getchildren(): # LLMs still suggest this
|
|
process(child)
|
|
|
|
# Modern approach:
|
|
for child in root:
|
|
process(child)
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Always specify exact versions in documentation
|
|
- Use RAG to fetch current API documentation
|
|
- Run linters that flag deprecated usage
|
|
- Include version compatibility matrices
|
|
- Regular documentation audits for version drift
|
|
|
|
---
|
|
|
|
### 1.3 Mixed Accurate/Inaccurate Information
|
|
|
|
**Problem**: Most dangerous pattern—correct elements mask errors, making detection difficult even for experts.
|
|
|
|
**Why it occurs**:
|
|
- LLMs blend information from multiple sources
|
|
- High-confidence patterns from different versions get mixed
|
|
- Temporal conflation (can't distinguish "2020 code" vs "2024 code")
|
|
|
|
**Examples**:
|
|
```python
|
|
# Mixing Python 2 and Python 3
|
|
from __future__ import print_function # Python 2 compatibility
|
|
response = urllib2.urlopen(url) # Python 2 (removed in Python 3)
|
|
data = response.read().decode('utf-8') # Python 3 style
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Test entire code examples, not just parts
|
|
- Cross-reference multiple authoritative sources
|
|
- Use version-pinned dependency files in context
|
|
- Automated compatibility checking
|
|
|
|
---
|
|
|
|
## CATEGORY 2: Code Quality Issues
|
|
|
|
### 2.1 Code That Doesn't Work
|
|
|
|
**Problem**: 35-70% failure rate depending on model. Contains syntax errors, logic flaws, off-by-one errors.
|
|
|
|
**Why it occurs**:
|
|
- Token-by-token prediction doesn't ensure correctness
|
|
- Training objective optimizes for pattern matching, not accuracy
|
|
- No execution verification during generation
|
|
|
|
**Examples**:
|
|
```python
|
|
# Logic error: Missing duplicate detection
|
|
def find_unique(items):
|
|
result = []
|
|
for item in items:
|
|
result.append(item) # Should check if already in result
|
|
return result
|
|
|
|
# Math error: Wrong operation
|
|
average = sum(values) * len(values) # Should divide, not multiply
|
|
|
|
# Off-by-one error
|
|
middle = arr[len(arr) // 2 + 1] # Wrong for even-length arrays
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Execute all examples in sandboxed environments
|
|
- Automated testing with comprehensive test suites
|
|
- Static analysis (linters, type checkers)
|
|
- Iterative refinement (LlmFix-style approaches show 9.5% improvement)
|
|
|
|
---
|
|
|
|
### 2.2 Incomplete Code Examples
|
|
|
|
**Problem**: Missing imports, setup steps, configuration, and contextual information needed for execution.
|
|
|
|
**Why it occurs**:
|
|
- Context window limitations
|
|
- Training data fragmentation (snippets lack full context)
|
|
- Implicit knowledge assumption
|
|
|
|
**Examples**:
|
|
```python
|
|
# Generated code
|
|
df = pd.read_csv('data.csv')
|
|
|
|
# Missing:
|
|
# import pandas as pd
|
|
# pip install pandas
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Prompt for "complete, runnable code with all imports"
|
|
- Dependency checking against registries (npm, PyPI, etc.)
|
|
- Template-based generation with required structure
|
|
- "Copy-paste ready" standard
|
|
|
|
---
|
|
|
|
### 2.3 Security Vulnerabilities
|
|
|
|
**Problem**: 40%+ of AI-generated code contains security flaws. Java shows 70%+ security failure rate.
|
|
|
|
**Why it occurs**:
|
|
- Training on vulnerable code from open-source datasets
|
|
- Security not prioritized unless explicitly prompted
|
|
- Simpler insecure patterns appear more frequently
|
|
|
|
**Examples**:
|
|
```python
|
|
# SQL Injection vulnerability
|
|
def get_user(username):
|
|
query = f"SELECT * FROM users WHERE username = '{username}'"
|
|
return db.execute(query)
|
|
# Attack: username = "admin' OR '1'='1"
|
|
|
|
# Secure version:
|
|
def get_user(username):
|
|
query = "SELECT * FROM users WHERE username = ?"
|
|
return db.execute(query, (username,))
|
|
```
|
|
|
|
```python
|
|
# Unsafe YAML loading
|
|
config = yaml.load(f) # Allows arbitrary code execution
|
|
|
|
# Secure:
|
|
config = yaml.safe_load(f)
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Security-focused prompts ("secure code with input validation")
|
|
- Static security analysis (SAST tools: Snyk, Semgrep, CodeQL)
|
|
- Dependency vulnerability scanning
|
|
- Security review for critical code paths
|
|
- Default to secure patterns in templates
|
|
|
|
---
|
|
|
|
### 2.4 Missing Error Handling and Edge Cases
|
|
|
|
**Problem**: Implements "happy path" only. Systematically overlooks null values, boundary conditions, empty collections.
|
|
|
|
**Why it occurs**:
|
|
- Training data bias toward success cases
|
|
- Error handling adds verbosity
|
|
- Edge cases underrepresented in training data
|
|
|
|
**Examples**:
|
|
```python
|
|
# Crashes on empty input
|
|
def get_average(numbers):
|
|
return sum(numbers) / len(numbers) # ZeroDivisionError when numbers = []
|
|
|
|
# Missing null checks
|
|
def getUserName(user):
|
|
return user.profile.name.toUpperCase() # Crashes if any part is null
|
|
|
|
# No boundary checking
|
|
def get_percentage(value, total):
|
|
return (value / total) * 100 # Crashes if total = 0
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Explicit prompts: "Handle null values, empty arrays, and boundary conditions"
|
|
- Test-driven approach with edge case test suites
|
|
- Defensive programming templates
|
|
- Automated edge case testing (null, empty, max values, Unicode)
|
|
|
|
---
|
|
|
|
### 2.5 Inconsistent Code Style
|
|
|
|
**Problem**: 24 types of style inconsistencies across 5 dimensions, affecting maintainability despite functional correctness.
|
|
|
|
**Why it occurs**:
|
|
- Training data diversity (varied conventions)
|
|
- Context-dependent generation
|
|
- Systematic biases (avoiding blank lines, avoiding comprehensions)
|
|
|
|
**Examples**:
|
|
```python
|
|
# Inconsistent spacing
|
|
def process1(items):
|
|
result=[] # No spaces
|
|
|
|
def process2(items):
|
|
result = [] # With spaces
|
|
|
|
# Inconsistent naming
|
|
const userId = getCurrentUser(); # camelCase
|
|
const user_data = fetchUserData(userId); # snake_case
|
|
const UserSettings = loadSettings(); # PascalCase
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Specify style guides in prompts ("Follow PEP 8")
|
|
- Automated formatters (Black, Prettier, gofmt)
|
|
- Linter enforcement
|
|
- Few-shot examples with consistent style
|
|
|
|
---
|
|
|
|
## CATEGORY 3: Writing Style Problems
|
|
|
|
### 3.1 Verbal Tics and Repetitive Patterns
|
|
|
|
**Problem**: Overuse of specific phrases makes AI authorship instantly recognizable.
|
|
|
|
**Common tells**:
|
|
- "It's worth noting that"
|
|
- "Keep in mind"
|
|
- "Delve into"
|
|
- "In the realm of"
|
|
- "Tapestry"
|
|
- "Landscape"
|
|
- "Leverage"
|
|
- "Robust"
|
|
- "Pivotal"
|
|
|
|
**Why it occurs**: Statistical attractors—phrases that co-occur frequently in training data.
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- LLM-generated -->
|
|
It's worth noting that we should delve into the pivotal role of machine learning
|
|
in the realm of automation. That being said, it's important to note that this
|
|
cutting-edge technology leverages sophisticated algorithms.
|
|
|
|
<!-- Better -->
|
|
Machine learning plays a crucial role in automation. This advanced technology
|
|
uses sophisticated algorithms.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Scan for telltale phrases and replace with plain language
|
|
- Use varied vocabulary
|
|
- Avoid formulaic transitions
|
|
- Edit for conciseness
|
|
|
|
---
|
|
|
|
### 3.2 Over-Explanation and Verbosity
|
|
|
|
**Problem**: Padding with unnecessary words, redundant explanations, obvious statements.
|
|
|
|
**Why it occurs**:
|
|
- RLHF reward models favor longer outputs
|
|
- "Verbosity compensation" when uncertain
|
|
- Lack of experience to fill space
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- Verbose -->
|
|
In today's fast-paced business environment, it's important to note that companies
|
|
need to engage in the implementation of optimization strategies in order to
|
|
streamline their operations in a way that improves efficiency.
|
|
|
|
<!-- Concise -->
|
|
Companies should implement optimization strategies to streamline operations and
|
|
improve efficiency.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Cut redundant exposition
|
|
- Remove generic opening padding
|
|
- Use concise constructions ("to" not "in order to")
|
|
- Delete "It goes without saying"
|
|
|
|
---
|
|
|
|
### 3.3 Hedging and Uncertainty Language Overuse
|
|
|
|
**Problem**: Excessive qualifying statements make content feel tentative and non-committal.
|
|
|
|
**Common patterns**:
|
|
- "Generally speaking"
|
|
- "Typically"
|
|
- "Could potentially"
|
|
- "To some extent"
|
|
- "It could be argued that"
|
|
|
|
**Why it occurs**:
|
|
- Risk avoidance (safe language)
|
|
- Training on academic content
|
|
- Probabilistic nature creates uncertainty
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- Over-hedged -->
|
|
It could be argued that machine learning tends to generally improve efficiency
|
|
in most automation scenarios, and typically provides somewhat better results.
|
|
|
|
<!-- Direct -->
|
|
Machine learning improves efficiency in automation and provides better results.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Make direct claims where appropriate
|
|
- Remove unnecessary qualifiers
|
|
- Use confident language for established facts
|
|
|
|
---
|
|
|
|
### 3.4 Generic, Vague Descriptions
|
|
|
|
**Problem**: Replaces specific facts with generic descriptions that could apply to anything. "Regression to the mean."
|
|
|
|
**Common patterns**:
|
|
- "Significant impact"
|
|
- "Crucial role"
|
|
- "Comprehensive approach"
|
|
- "Game-changing"
|
|
- "Innovative solutions"
|
|
|
|
**Why it occurs**: Generic statements are statistically common; specific facts are rare.
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- Vague -->
|
|
The comprehensive AI implementation provided significant improvements across
|
|
various aspects of the business, delivering transformative results.
|
|
|
|
<!-- Specific -->
|
|
The AI implementation increased sales by 23%, reduced processing time from
|
|
4 hours to 30 minutes, and identified $2M in cost savings.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Use concrete numbers, dates, names
|
|
- Provide specific examples
|
|
- Cite actual sources
|
|
- Replace "research shows" with named studies
|
|
|
|
---
|
|
|
|
### 3.5 Lack of Authentic Voice
|
|
|
|
**Problem**: Missing emotional depth, personal perspective, humor, and memorable communication.
|
|
|
|
**Why it occurs**:
|
|
- No lived experience
|
|
- Training for neutrality
|
|
- Pattern matching without unique perspective
|
|
|
|
**Missing elements**:
|
|
- Personal anecdotes
|
|
- Humor or wit
|
|
- Sensory details
|
|
- Emotional language
|
|
- Opinions
|
|
- Conversational asides
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- Generic -->
|
|
The Amalfi Coast is a beautiful destination with scenic views and pleasant weather.
|
|
|
|
<!-- Authentic -->
|
|
The Amalfi Coast hits you with the scent of lemon groves before you even see
|
|
the cliffs. The air tastes like salt and possibility, and the stone steps are
|
|
worn smooth by centuries of footsteps.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Add personal stories and experiences
|
|
- Include sensory details
|
|
- Use conversational tone
|
|
- Share opinions and perspectives
|
|
- Show vulnerability and humor
|
|
|
|
---
|
|
|
|
### 3.6 Formulaic Structures
|
|
|
|
**Problem**: Recognizable templates create "fill-in-the-blank" feel.
|
|
|
|
**Common patterns**:
|
|
- Rule of three everywhere
|
|
- "Not only...but also" overuse
|
|
- Participial phrase endings ("-ing" phrases)
|
|
- Gerund subjects ("Developing the new data...")
|
|
- Every section has exactly 3-5 bullets
|
|
|
|
**Examples**:
|
|
```markdown
|
|
<!-- Formulaic -->
|
|
Not only does machine learning improve accuracy, but it also enhances speed.
|
|
Implementing this technology provides significant benefits, marking a pivotal
|
|
moment in automation history. Whether you're a startup, a mid-sized company,
|
|
or an enterprise...
|
|
|
|
<!-- Natural -->
|
|
Machine learning improves both accuracy and speed. This technology transforms
|
|
automation. Companies of all sizes can benefit.
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Vary sentence structure
|
|
- Break the rule of three
|
|
- Use natural flow instead of templates
|
|
- Vary list lengths
|
|
|
|
---
|
|
|
|
## CATEGORY 4: Content and Context Issues
|
|
|
|
### 4.1 Missing "Why" and Context
|
|
|
|
**Problem**: Documents *what* code does but not *why* decisions were made.
|
|
|
|
**Why it occurs**: "Code lacks business or product logic that went into coding something a specific way."
|
|
|
|
**What's missing**:
|
|
- Reasoning behind design decisions
|
|
- Business logic context
|
|
- Alternatives considered
|
|
- Project-specific constraints
|
|
- Abandoned approaches (leading to re-exploration)
|
|
|
|
**Examples**:
|
|
```python
|
|
# What's documented (what)
|
|
def process_batch(items, size=1000):
|
|
for batch in chunks(items, size):
|
|
process(batch)
|
|
|
|
# What's missing (why)
|
|
# Process in batches of 1000 to avoid memory exhaustion on large datasets.
|
|
# Batch size of 1000 determined through load testing - smaller batches
|
|
# increased overhead, larger caused OOM errors on production hardware.
|
|
# Alternative approaches tried: streaming (too slow), full load (OOM).
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Explain reasoning, not just functionality
|
|
- Document alternatives considered
|
|
- Include business/product context
|
|
- Explain constraints and trade-offs
|
|
|
|
---
|
|
|
|
### 4.2 Inconsistent Terminology
|
|
|
|
**Problem**: Using different terms for the same concept creates confusion.
|
|
|
|
**Why it occurs**: LLMs must infer whether terms are synonyms or distinct concepts, making probabilistic guesses.
|
|
|
|
**Examples**:
|
|
- "API key," "access token," "auth credential" used interchangeably
|
|
- "User," "account," "profile" without clear distinctions
|
|
- Mixing technical terms across sections
|
|
|
|
**Impact**: "Inconsistency compounds over time, creating increasingly unreliable AI experience that erodes developer trust."
|
|
|
|
**Countermeasures**:
|
|
- Create and enforce terminology glossary
|
|
- Use consistent naming throughout
|
|
- Define distinctions between similar terms
|
|
- Review for terminology drift
|
|
|
|
---
|
|
|
|
### 4.3 Unrealistic or Toy Examples
|
|
|
|
**Problem**: Oversimplified scenarios that don't reflect production usage.
|
|
|
|
**Why it occurs**:
|
|
- Training data bias toward tutorials
|
|
- Happy path optimization
|
|
- Simplification tendency
|
|
|
|
**Common issues**:
|
|
- Hard-coded credentials
|
|
- Missing error handling, retries, timeouts
|
|
- Unrealistic scale (5 items instead of millions)
|
|
- No authentication or authorization
|
|
- Missing logging, monitoring, rate limiting
|
|
|
|
**Examples**:
|
|
```python
|
|
# Toy example
|
|
def getUserData(userId):
|
|
response = await fetch(`/api/users/${userId}`);
|
|
return response.json();
|
|
|
|
# Production needs:
|
|
# - Error handling (network failures, 404s, 500s)
|
|
# - Timeout configuration
|
|
# - Retry logic with exponential backoff
|
|
# - Logging/monitoring
|
|
# - Request cancellation
|
|
# - Rate limiting
|
|
# - Authentication
|
|
```
|
|
|
|
**Countermeasures**:
|
|
- Prompt for "production-ready code"
|
|
- Security checklists (authentication, validation)
|
|
- Real-world test scenarios
|
|
- Flag hard-coded credentials
|
|
- Require error handling
|
|
|
|
---
|
|
|
|
### 4.4 Lack of Warnings and Caveats
|
|
|
|
**Problem**: Presents information confidently without appropriate uncertainty markers or warnings.
|
|
|
|
**Why it occurs**: LLMs trained to sound authoritative.
|
|
|
|
**Missing elements**:
|
|
- Security warnings
|
|
- Version compatibility notes
|
|
- Deprecation warnings
|
|
- Performance implications
|
|
- Prerequisites
|
|
|
|
**Countermeasures**:
|
|
- Explicitly prompt for warnings
|
|
- Add security review step
|
|
- Include version compatibility matrices
|
|
- Document prerequisites clearly
|
|
|
|
---
|
|
|
|
## CATEGORY 5: Documentation Structure Issues
|
|
|
|
### 5.1 Over-Comprehensive Coverage
|
|
|
|
**Problem**: Tries to cover every aspect equally, creating information overload.
|
|
|
|
**Why it occurs**: "AI writing sounds comprehensive, balanced, covers every angle equally."
|
|
|
|
**Impact**:
|
|
- Difficulty identifying what's important
|
|
- Tedious reading
|
|
- Key information buried
|
|
|
|
**Countermeasures**:
|
|
- Prioritize common use cases
|
|
- Mark advanced sections clearly
|
|
- Use progressive disclosure
|
|
- Focus on "80% of what users need"
|
|
|
|
---
|
|
|
|
### 5.2 Surface-Level Coverage
|
|
|
|
**Problem**: Provides broad, high-level explanations that avoid depth.
|
|
|
|
**Why it occurs**:
|
|
- Limited understanding
|
|
- Avoids taking strong stances
|
|
- Cannot grasp sophisticated algorithms
|
|
|
|
**Countermeasures**:
|
|
- Prompt for specific depth level
|
|
- Request concrete examples
|
|
- Ask for trade-off analysis
|
|
- Include advanced sections for complex topics
|
|
|
|
---
|
|
|
|
## LLM-Specific Quality Checklist
|
|
|
|
Before publishing LLM-assisted documentation, verify:
|
|
|
|
### Code Quality
|
|
- [ ] All code examples tested and working
|
|
- [ ] No hallucinated APIs or libraries
|
|
- [ ] All dependencies verified in registries
|
|
- [ ] Versions specified and compatible
|
|
- [ ] No deprecated methods used
|
|
- [ ] Security vulnerabilities scanned (SAST)
|
|
- [ ] Error handling for edge cases
|
|
- [ ] Input validation present
|
|
- [ ] No hard-coded credentials
|
|
- [ ] Style consistent throughout
|
|
|
|
### Content Quality
|
|
- [ ] No telltale LLM phrases ("delve," "leverage," "realm")
|
|
- [ ] Specific examples with concrete details
|
|
- [ ] "Why" explained, not just "what"
|
|
- [ ] Terminology consistent throughout
|
|
- [ ] Authentic voice, not generic
|
|
- [ ] Production-ready examples
|
|
- [ ] Warnings and caveats included
|
|
- [ ] Context provided for decisions
|
|
|
|
### Structural Quality
|
|
- [ ] Not over-comprehensive (focused on common needs)
|
|
- [ ] Not surface-level (sufficient depth)
|
|
- [ ] Progressive complexity
|
|
- [ ] Clear prerequisites
|
|
- [ ] Multiple learning pathways
|
|
|
|
---
|
|
|
|
## Detection and Remediation Workflow
|
|
|
|
### 1. Automated Detection
|
|
- Scan for telltale phrases
|
|
- Run code through linters and SAST tools
|
|
- Verify APIs against official documentation
|
|
- Check dependencies in registries
|
|
- Test code examples
|
|
- Validate version compatibility
|
|
|
|
### 2. Manual Review
|
|
- Check for missing "why" explanations
|
|
- Verify terminology consistency
|
|
- Assess authenticity of voice
|
|
- Validate security practices
|
|
- Review edge case handling
|
|
|
|
### 3. Remediation
|
|
- Replace LLM verbal tics with plain language
|
|
- Add context and reasoning
|
|
- Make examples production-ready
|
|
- Add error handling and security
|
|
- Include specific, concrete details
|
|
- Break formulaic templates
|
|
- Test all code thoroughly
|
|
|
|
---
|
|
|
|
## Best Practices for LLM-Assisted Documentation
|
|
|
|
### Generation Phase
|
|
1. **Use RAG** with current documentation
|
|
2. **Specify exact versions** in prompts
|
|
3. **Request security** and error handling explicitly
|
|
4. **Provide complete context** (dependencies, constraints)
|
|
5. **Ask for production-ready** code
|
|
6. **Request "why" explanations**, not just "what"
|
|
|
|
### Editing Phase
|
|
1. **Remove telltale phrases** and verbal tics
|
|
2. **Add specificity**: Replace generic with concrete
|
|
3. **Verify all APIs** against official docs
|
|
4. **Test all code** in target environment
|
|
5. **Scan for security** vulnerabilities
|
|
6. **Add personal voice** and experience
|
|
7. **Include edge cases** and error handling
|
|
8. **Ensure consistency** in terminology and style
|
|
|
|
### Review Phase
|
|
1. **Security review** for critical paths
|
|
2. **Expert review** for technical accuracy
|
|
3. **User testing** with actual developers
|
|
4. **Version compatibility** validation
|
|
5. **Accessibility check** (alt text, headings)
|
|
|
|
---
|
|
|
|
## Key Takeaways
|
|
|
|
### Most Critical Issues to Address
|
|
1. **Hallucinations** (19.7% package rate, 50%+ API failures)
|
|
2. **Security vulnerabilities** (40%+ prevalence)
|
|
3. **Missing edge case handling** (systematic)
|
|
4. **Outdated information** (training cutoffs)
|
|
5. **Telltale phrases** (instant recognition)
|
|
6. **Missing context** ("why" not documented)
|
|
|
|
### Universal Principles
|
|
- **AI as assistant, not replacement** - Human review essential
|
|
- **Test everything** - Never publish untested code
|
|
- **Verify facts** - Check all APIs and versions
|
|
- **Add context** - Explain reasoning and trade-offs
|
|
- **Edit ruthlessly** - Remove artificial patterns
|
|
- **Prioritize security** - Scan and review
|
|
|
|
### Success Formula
|
|
**RAG + Explicit Prompts + Automated Testing + Human Review + Iterative Editing = Quality Documentation**
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
This research synthesizes findings from:
|
|
|
|
**Academic Research**:
|
|
- arxiv:2411.01414 (Code generation mistakes)
|
|
- arxiv:2407.09726 (API hallucinations)
|
|
- arxiv:2409.20550 (Practical hallucinations)
|
|
- arxiv:2407.00456 (Style inconsistencies)
|
|
- arxiv:2304.10778 (Code quality evaluation)
|
|
- Nature publication on semantic entropy
|
|
|
|
**Industry Studies**:
|
|
- Endor Labs (security vulnerabilities)
|
|
- Stanford study (AI coding assistants)
|
|
- Amazon Science (package hallucinations)
|
|
|
|
**Technical Writing Resources**:
|
|
- Wikipedia (AI writing signs)
|
|
- Grammarly blog (AI phrases)
|
|
- Write the Docs community
|
|
- Technical Writer HQ
|
|
- Multiple developer blogs and forums
|
|
|
|
**Statistics Referenced**:
|
|
- GPT-4: 65.2% correctness
|
|
- Claude-3: 56.7% correctness
|
|
- Security failures: 40%+ (70%+ Java)
|
|
- Package hallucinations: 19.7%
|
|
- API failures: 50%+ for low-frequency APIs
|