611 lines
14 KiB
Markdown
611 lines
14 KiB
Markdown
# PDF Form Processing Guide
|
|
|
|
Complete guide for processing PDF forms in production environments.
|
|
|
|
## Table of contents
|
|
|
|
- Form analysis and field detection
|
|
- Form filling workflows
|
|
- Validation strategies
|
|
- Field types and handling
|
|
- Multi-page forms
|
|
- Flattening and finalization
|
|
- Error handling patterns
|
|
- Production examples
|
|
|
|
## Form analysis
|
|
|
|
### Analyze form structure
|
|
|
|
Use `analyze_form.py` to extract complete form information:
|
|
|
|
```bash
|
|
python scripts/analyze_form.py application.pdf --output schema.json
|
|
```
|
|
|
|
Output format:
|
|
|
|
```json
|
|
{
|
|
"full_name": {
|
|
"type": "text",
|
|
"required": true,
|
|
"max_length": 100,
|
|
"x": 120.5,
|
|
"y": 450.2,
|
|
"width": 300,
|
|
"height": 20
|
|
},
|
|
"date_of_birth": {
|
|
"type": "text",
|
|
"required": true,
|
|
"format": "MM/DD/YYYY",
|
|
"x": 120.5,
|
|
"y": 400.8,
|
|
"width": 150,
|
|
"height": 20
|
|
},
|
|
"email_newsletter": {
|
|
"type": "checkbox",
|
|
"required": false,
|
|
"x": 120.5,
|
|
"y": 350.4,
|
|
"width": 15,
|
|
"height": 15
|
|
},
|
|
"preferred_contact": {
|
|
"type": "radio",
|
|
"required": true,
|
|
"options": ["email", "phone", "mail"],
|
|
"x": 120.5,
|
|
"y": 300.0,
|
|
"width": 200,
|
|
"height": 60
|
|
}
|
|
}
|
|
```
|
|
|
|
### Programmatic analysis
|
|
|
|
```python
|
|
from pypdf import PdfReader
|
|
|
|
reader = PdfReader("form.pdf")
|
|
fields = reader.get_fields()
|
|
|
|
for field_name, field_info in fields.items():
|
|
print(f"Field: {field_name}")
|
|
print(f" Type: {field_info.get('/FT')}")
|
|
print(f" Value: {field_info.get('/V')}")
|
|
print(f" Flags: {field_info.get('/Ff', 0)}")
|
|
print()
|
|
```
|
|
|
|
## Form filling workflows
|
|
|
|
### Basic workflow
|
|
|
|
```bash
|
|
# 1. Analyze form
|
|
python scripts/analyze_form.py template.pdf --output schema.json
|
|
|
|
# 2. Prepare data
|
|
cat > data.json << EOF
|
|
{
|
|
"full_name": "John Doe",
|
|
"date_of_birth": "01/15/1990",
|
|
"email": "john@example.com",
|
|
"email_newsletter": true,
|
|
"preferred_contact": "email"
|
|
}
|
|
EOF
|
|
|
|
# 3. Validate data
|
|
python scripts/validate_form.py data.json schema.json
|
|
|
|
# 4. Fill form
|
|
python scripts/fill_form.py template.pdf data.json filled.pdf
|
|
|
|
# 5. Flatten (optional - makes fields non-editable)
|
|
python scripts/flatten_form.py filled.pdf final.pdf
|
|
```
|
|
|
|
### Programmatic filling
|
|
|
|
```python
|
|
from pypdf import PdfReader, PdfWriter
|
|
|
|
reader = PdfReader("template.pdf")
|
|
writer = PdfWriter()
|
|
|
|
# Clone all pages
|
|
for page in reader.pages:
|
|
writer.add_page(page)
|
|
|
|
# Fill form fields
|
|
writer.update_page_form_field_values(
|
|
writer.pages[0],
|
|
{
|
|
"full_name": "John Doe",
|
|
"date_of_birth": "01/15/1990",
|
|
"email": "john@example.com",
|
|
"email_newsletter": "/Yes", # Checkbox value
|
|
"preferred_contact": "/email" # Radio value
|
|
}
|
|
)
|
|
|
|
# Save filled form
|
|
with open("filled.pdf", "wb") as output:
|
|
writer.write(output)
|
|
```
|
|
|
|
## Field types and handling
|
|
|
|
### Text fields
|
|
|
|
```python
|
|
# Simple text
|
|
field_values["customer_name"] = "Jane Smith"
|
|
|
|
# Formatted text (dates)
|
|
field_values["date"] = "12/25/2024"
|
|
|
|
# Numbers
|
|
field_values["amount"] = "1234.56"
|
|
|
|
# Multi-line text
|
|
field_values["comments"] = "Line 1\nLine 2\nLine 3"
|
|
```
|
|
|
|
### Checkboxes
|
|
|
|
Checkboxes typically use `/Yes` for checked, `/Off` for unchecked:
|
|
|
|
```python
|
|
# Check checkbox
|
|
field_values["agree_to_terms"] = "/Yes"
|
|
|
|
# Uncheck checkbox
|
|
field_values["newsletter_opt_out"] = "/Off"
|
|
```
|
|
|
|
**Note**: Some PDFs use different values. Check with `analyze_form.py`:
|
|
|
|
```json
|
|
{
|
|
"some_checkbox": {
|
|
"type": "checkbox",
|
|
"on_value": "/On", # ← Check this
|
|
"off_value": "/Off"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Radio buttons
|
|
|
|
Radio buttons are mutually exclusive options:
|
|
|
|
```python
|
|
# Select one option from radio group
|
|
field_values["preferred_contact"] = "/email"
|
|
|
|
# Other options in same group
|
|
# field_values["preferred_contact"] = "/phone"
|
|
# field_values["preferred_contact"] = "/mail"
|
|
```
|
|
|
|
### Dropdown/List boxes
|
|
|
|
```python
|
|
# Single selection
|
|
field_values["country"] = "United States"
|
|
|
|
# List of available options in schema
|
|
"country": {
|
|
"type": "dropdown",
|
|
"options": ["United States", "Canada", "Mexico", ...]
|
|
}
|
|
```
|
|
|
|
## Validation strategies
|
|
|
|
### Schema-based validation
|
|
|
|
```python
|
|
import json
|
|
from jsonschema import validate, ValidationError
|
|
|
|
# Load schema from analyze_form.py output
|
|
with open("schema.json") as f:
|
|
schema = json.load(f)
|
|
|
|
# Load form data
|
|
with open("data.json") as f:
|
|
data = json.load(f)
|
|
|
|
# Validate all fields
|
|
errors = []
|
|
|
|
for field_name, field_schema in schema.items():
|
|
value = data.get(field_name)
|
|
|
|
# Check required fields
|
|
if field_schema.get("required") and not value:
|
|
errors.append(f"Missing required field: {field_name}")
|
|
|
|
# Check field type
|
|
if value and field_schema.get("type") == "text":
|
|
if not isinstance(value, str):
|
|
errors.append(f"Field {field_name} must be string")
|
|
|
|
# Check max length
|
|
max_length = field_schema.get("max_length")
|
|
if value and max_length and len(str(value)) > max_length:
|
|
errors.append(f"Field {field_name} exceeds max length {max_length}")
|
|
|
|
# Check format (dates, emails, etc)
|
|
format_type = field_schema.get("format")
|
|
if value and format_type:
|
|
if not validate_format(value, format_type):
|
|
errors.append(f"Field {field_name} has invalid format")
|
|
|
|
if errors:
|
|
print("Validation errors:")
|
|
for error in errors:
|
|
print(f" - {error}")
|
|
exit(1)
|
|
|
|
print("Validation passed")
|
|
```
|
|
|
|
### Format validation
|
|
|
|
```python
|
|
import re
|
|
from datetime import datetime
|
|
|
|
def validate_format(value, format_type):
|
|
"""Validate field format."""
|
|
|
|
if format_type == "email":
|
|
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
|
|
return re.match(pattern, value) is not None
|
|
|
|
elif format_type == "phone":
|
|
# US phone: (555) 123-4567 or 555-123-4567
|
|
pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
|
|
return re.match(pattern, value) is not None
|
|
|
|
elif format_type == "MM/DD/YYYY":
|
|
try:
|
|
datetime.strptime(value, "%m/%d/%Y")
|
|
return True
|
|
except ValueError:
|
|
return False
|
|
|
|
elif format_type == "SSN":
|
|
# XXX-XX-XXXX
|
|
pattern = r'^\d{3}-\d{2}-\d{4}$'
|
|
return re.match(pattern, value) is not None
|
|
|
|
elif format_type == "ZIP":
|
|
# XXXXX or XXXXX-XXXX
|
|
pattern = r'^\d{5}(-\d{4})?$'
|
|
return re.match(pattern, value) is not None
|
|
|
|
return True # Unknown format, skip validation
|
|
```
|
|
|
|
## Multi-page forms
|
|
|
|
### Handling multi-page forms
|
|
|
|
```python
|
|
from pypdf import PdfReader, PdfWriter
|
|
|
|
reader = PdfReader("multi_page_form.pdf")
|
|
writer = PdfWriter()
|
|
|
|
# Clone all pages
|
|
for page in reader.pages:
|
|
writer.add_page(page)
|
|
|
|
# Fill fields on page 1
|
|
writer.update_page_form_field_values(
|
|
writer.pages[0],
|
|
{
|
|
"name_page1": "John Doe",
|
|
"email_page1": "john@example.com"
|
|
}
|
|
)
|
|
|
|
# Fill fields on page 2
|
|
writer.update_page_form_field_values(
|
|
writer.pages[1],
|
|
{
|
|
"address_page2": "123 Main St",
|
|
"city_page2": "Springfield"
|
|
}
|
|
)
|
|
|
|
# Fill fields on page 3
|
|
writer.update_page_form_field_values(
|
|
writer.pages[2],
|
|
{
|
|
"signature_page3": "John Doe",
|
|
"date_page3": "12/25/2024"
|
|
}
|
|
)
|
|
|
|
with open("filled_multi_page.pdf", "wb") as output:
|
|
writer.write(output)
|
|
```
|
|
|
|
### Identifying page-specific fields
|
|
|
|
```python
|
|
# Analyze which fields are on which pages
|
|
for page_num, page in enumerate(reader.pages, 1):
|
|
fields = page.get("/Annots", [])
|
|
|
|
if fields:
|
|
print(f"\nPage {page_num} fields:")
|
|
for field_ref in fields:
|
|
field = field_ref.get_object()
|
|
field_name = field.get("/T", "Unknown")
|
|
print(f" - {field_name}")
|
|
```
|
|
|
|
## Flattening forms
|
|
|
|
### Why flatten
|
|
|
|
Flattening makes form fields non-editable, embedding values permanently:
|
|
|
|
- **Security**: Prevent modifications
|
|
- **Distribution**: Share read-only forms
|
|
- **Printing**: Ensure correct appearance
|
|
- **Archival**: Long-term storage
|
|
|
|
### Flatten with pypdf
|
|
|
|
```python
|
|
from pypdf import PdfReader, PdfWriter
|
|
|
|
reader = PdfReader("filled.pdf")
|
|
writer = PdfWriter()
|
|
|
|
# Add all pages
|
|
for page in reader.pages:
|
|
writer.add_page(page)
|
|
|
|
# Flatten all form fields
|
|
writer.flatten_fields()
|
|
|
|
# Save flattened PDF
|
|
with open("flattened.pdf", "wb") as output:
|
|
writer.write(output)
|
|
```
|
|
|
|
### Using included script
|
|
|
|
```bash
|
|
python scripts/flatten_form.py filled.pdf flattened.pdf
|
|
```
|
|
|
|
## Error handling patterns
|
|
|
|
### Robust form filling
|
|
|
|
```python
|
|
import logging
|
|
from pathlib import Path
|
|
from pypdf import PdfReader, PdfWriter
|
|
from pypdf.errors import PdfReadError
|
|
|
|
logging.basicConfig(level=logging.INFO)
|
|
logger = logging.getLogger(__name__)
|
|
|
|
def fill_form_safe(template_path, data, output_path):
|
|
"""Fill form with comprehensive error handling."""
|
|
|
|
try:
|
|
# Validate inputs
|
|
template = Path(template_path)
|
|
if not template.exists():
|
|
raise FileNotFoundError(f"Template not found: {template_path}")
|
|
|
|
# Read template
|
|
logger.info(f"Reading template: {template_path}")
|
|
reader = PdfReader(template_path)
|
|
|
|
if not reader.pages:
|
|
raise ValueError("PDF has no pages")
|
|
|
|
# Check if form has fields
|
|
fields = reader.get_fields()
|
|
if not fields:
|
|
logger.warning("PDF has no form fields")
|
|
return False
|
|
|
|
# Create writer
|
|
writer = PdfWriter()
|
|
for page in reader.pages:
|
|
writer.add_page(page)
|
|
|
|
# Validate data against schema
|
|
missing_required = []
|
|
invalid_fields = []
|
|
|
|
for field_name, field_info in fields.items():
|
|
# Check required fields
|
|
is_required = field_info.get("/Ff", 0) & 2 == 2
|
|
if is_required and field_name not in data:
|
|
missing_required.append(field_name)
|
|
|
|
# Check invalid field names in data
|
|
if field_name in data:
|
|
value = data[field_name]
|
|
# Add type validation here if needed
|
|
|
|
if missing_required:
|
|
raise ValueError(f"Missing required fields: {missing_required}")
|
|
|
|
# Fill fields
|
|
logger.info("Filling form fields")
|
|
writer.update_page_form_field_values(
|
|
writer.pages[0],
|
|
data
|
|
)
|
|
|
|
# Write output
|
|
logger.info(f"Writing output: {output_path}")
|
|
with open(output_path, "wb") as output:
|
|
writer.write(output)
|
|
|
|
logger.info("Form filled successfully")
|
|
return True
|
|
|
|
except PdfReadError as e:
|
|
logger.error(f"PDF read error: {e}")
|
|
return False
|
|
|
|
except FileNotFoundError as e:
|
|
logger.error(f"File error: {e}")
|
|
return False
|
|
|
|
except ValueError as e:
|
|
logger.error(f"Validation error: {e}")
|
|
return False
|
|
|
|
except Exception as e:
|
|
logger.error(f"Unexpected error: {e}")
|
|
return False
|
|
|
|
# Usage
|
|
success = fill_form_safe(
|
|
"template.pdf",
|
|
{"name": "John", "email": "john@example.com"},
|
|
"filled.pdf"
|
|
)
|
|
|
|
if not success:
|
|
exit(1)
|
|
```
|
|
|
|
## Production examples
|
|
|
|
### Example 1: Batch form processing
|
|
|
|
```python
|
|
import json
|
|
import glob
|
|
from pathlib import Path
|
|
from fill_form_safe import fill_form_safe
|
|
|
|
# Process multiple submissions
|
|
submissions_dir = Path("submissions")
|
|
template = "application_template.pdf"
|
|
output_dir = Path("completed")
|
|
output_dir.mkdir(exist_ok=True)
|
|
|
|
for submission_file in submissions_dir.glob("*.json"):
|
|
print(f"Processing: {submission_file.name}")
|
|
|
|
# Load submission data
|
|
with open(submission_file) as f:
|
|
data = json.load(f)
|
|
|
|
# Fill form
|
|
applicant_id = data.get("id", "unknown")
|
|
output_file = output_dir / f"application_{applicant_id}.pdf"
|
|
|
|
success = fill_form_safe(template, data, output_file)
|
|
|
|
if success:
|
|
print(f" ✓ Completed: {output_file}")
|
|
else:
|
|
print(f" ✗ Failed: {submission_file.name}")
|
|
```
|
|
|
|
### Example 2: Form with conditional logic
|
|
|
|
```python
|
|
def prepare_form_data(raw_data):
|
|
"""Prepare form data with conditional logic."""
|
|
|
|
form_data = {}
|
|
|
|
# Basic fields
|
|
form_data["full_name"] = raw_data["name"]
|
|
form_data["email"] = raw_data["email"]
|
|
|
|
# Conditional fields
|
|
if raw_data.get("is_student"):
|
|
form_data["student_id"] = raw_data["student_id"]
|
|
form_data["school_name"] = raw_data["school"]
|
|
else:
|
|
form_data["employer"] = raw_data.get("employer", "")
|
|
|
|
# Checkbox logic
|
|
form_data["newsletter"] = "/Yes" if raw_data.get("opt_in") else "/Off"
|
|
|
|
# Calculated fields
|
|
total = sum(raw_data.get("items", []))
|
|
form_data["total_amount"] = f"${total:.2f}"
|
|
|
|
return form_data
|
|
|
|
# Usage
|
|
raw_input = {
|
|
"name": "Jane Smith",
|
|
"email": "jane@example.com",
|
|
"is_student": True,
|
|
"student_id": "12345",
|
|
"school": "State University",
|
|
"opt_in": True,
|
|
"items": [10.00, 25.50, 15.75]
|
|
}
|
|
|
|
form_data = prepare_form_data(raw_input)
|
|
fill_form_safe("template.pdf", form_data, "output.pdf")
|
|
```
|
|
|
|
## Best practices
|
|
|
|
1. **Always analyze before filling**: Use `analyze_form.py` to understand structure
|
|
2. **Validate early**: Check data before attempting to fill
|
|
3. **Use logging**: Track operations for debugging
|
|
4. **Handle errors gracefully**: Don't crash on invalid data
|
|
5. **Test with samples**: Verify with small datasets first
|
|
6. **Flatten when distributing**: Make read-only for recipients
|
|
7. **Keep templates versioned**: Track form template changes
|
|
8. **Document field mappings**: Maintain data-to-field documentation
|
|
|
|
## Troubleshooting
|
|
|
|
### Fields not filling
|
|
|
|
1. Check field names match exactly (case-sensitive)
|
|
2. Verify checkbox/radio values (`/Yes`, `/On`, etc.)
|
|
3. Ensure PDF is not encrypted or protected
|
|
4. Check if form uses XFA format (not supported by pypdf)
|
|
|
|
### Encoding issues
|
|
|
|
```python
|
|
# Handle special characters
|
|
field_values["name"] = "José García" # UTF-8 encoded
|
|
```
|
|
|
|
### Large batch processing
|
|
|
|
```python
|
|
# Process in chunks to avoid memory issues
|
|
chunk_size = 100
|
|
|
|
for i in range(0, len(submissions), chunk_size):
|
|
chunk = submissions[i:i + chunk_size]
|
|
process_batch(chunk)
|
|
```
|