14 KiB
14 KiB
PDF Form Processing Guide
Complete guide for processing PDF forms in production environments.
Table of contents
- Form analysis and field detection
- Form filling workflows
- Validation strategies
- Field types and handling
- Multi-page forms
- Flattening and finalization
- Error handling patterns
- Production examples
Form analysis
Analyze form structure
Use analyze_form.py to extract complete form information:
python scripts/analyze_form.py application.pdf --output schema.json
Output format:
{
"full_name": {
"type": "text",
"required": true,
"max_length": 100,
"x": 120.5,
"y": 450.2,
"width": 300,
"height": 20
},
"date_of_birth": {
"type": "text",
"required": true,
"format": "MM/DD/YYYY",
"x": 120.5,
"y": 400.8,
"width": 150,
"height": 20
},
"email_newsletter": {
"type": "checkbox",
"required": false,
"x": 120.5,
"y": 350.4,
"width": 15,
"height": 15
},
"preferred_contact": {
"type": "radio",
"required": true,
"options": ["email", "phone", "mail"],
"x": 120.5,
"y": 300.0,
"width": 200,
"height": 60
}
}
Programmatic analysis
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
for field_name, field_info in fields.items():
print(f"Field: {field_name}")
print(f" Type: {field_info.get('/FT')}")
print(f" Value: {field_info.get('/V')}")
print(f" Flags: {field_info.get('/Ff', 0)}")
print()
Form filling workflows
Basic workflow
# 1. Analyze form
python scripts/analyze_form.py template.pdf --output schema.json
# 2. Prepare data
cat > data.json << EOF
{
"full_name": "John Doe",
"date_of_birth": "01/15/1990",
"email": "john@example.com",
"email_newsletter": true,
"preferred_contact": "email"
}
EOF
# 3. Validate data
python scripts/validate_form.py data.json schema.json
# 4. Fill form
python scripts/fill_form.py template.pdf data.json filled.pdf
# 5. Flatten (optional - makes fields non-editable)
python scripts/flatten_form.py filled.pdf final.pdf
Programmatic filling
from pypdf import PdfReader, PdfWriter
reader = PdfReader("template.pdf")
writer = PdfWriter()
# Clone all pages
for page in reader.pages:
writer.add_page(page)
# Fill form fields
writer.update_page_form_field_values(
writer.pages[0],
{
"full_name": "John Doe",
"date_of_birth": "01/15/1990",
"email": "john@example.com",
"email_newsletter": "/Yes", # Checkbox value
"preferred_contact": "/email" # Radio value
}
)
# Save filled form
with open("filled.pdf", "wb") as output:
writer.write(output)
Field types and handling
Text fields
# Simple text
field_values["customer_name"] = "Jane Smith"
# Formatted text (dates)
field_values["date"] = "12/25/2024"
# Numbers
field_values["amount"] = "1234.56"
# Multi-line text
field_values["comments"] = "Line 1\nLine 2\nLine 3"
Checkboxes
Checkboxes typically use /Yes for checked, /Off for unchecked:
# Check checkbox
field_values["agree_to_terms"] = "/Yes"
# Uncheck checkbox
field_values["newsletter_opt_out"] = "/Off"
Note: Some PDFs use different values. Check with analyze_form.py:
{
"some_checkbox": {
"type": "checkbox",
"on_value": "/On", # ← Check this
"off_value": "/Off"
}
}
Radio buttons
Radio buttons are mutually exclusive options:
# Select one option from radio group
field_values["preferred_contact"] = "/email"
# Other options in same group
# field_values["preferred_contact"] = "/phone"
# field_values["preferred_contact"] = "/mail"
Dropdown/List boxes
# Single selection
field_values["country"] = "United States"
# List of available options in schema
"country": {
"type": "dropdown",
"options": ["United States", "Canada", "Mexico", ...]
}
Validation strategies
Schema-based validation
import json
from jsonschema import validate, ValidationError
# Load schema from analyze_form.py output
with open("schema.json") as f:
schema = json.load(f)
# Load form data
with open("data.json") as f:
data = json.load(f)
# Validate all fields
errors = []
for field_name, field_schema in schema.items():
value = data.get(field_name)
# Check required fields
if field_schema.get("required") and not value:
errors.append(f"Missing required field: {field_name}")
# Check field type
if value and field_schema.get("type") == "text":
if not isinstance(value, str):
errors.append(f"Field {field_name} must be string")
# Check max length
max_length = field_schema.get("max_length")
if value and max_length and len(str(value)) > max_length:
errors.append(f"Field {field_name} exceeds max length {max_length}")
# Check format (dates, emails, etc)
format_type = field_schema.get("format")
if value and format_type:
if not validate_format(value, format_type):
errors.append(f"Field {field_name} has invalid format")
if errors:
print("Validation errors:")
for error in errors:
print(f" - {error}")
exit(1)
print("Validation passed")
Format validation
import re
from datetime import datetime
def validate_format(value, format_type):
"""Validate field format."""
if format_type == "email":
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, value) is not None
elif format_type == "phone":
# US phone: (555) 123-4567 or 555-123-4567
pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
return re.match(pattern, value) is not None
elif format_type == "MM/DD/YYYY":
try:
datetime.strptime(value, "%m/%d/%Y")
return True
except ValueError:
return False
elif format_type == "SSN":
# XXX-XX-XXXX
pattern = r'^\d{3}-\d{2}-\d{4}$'
return re.match(pattern, value) is not None
elif format_type == "ZIP":
# XXXXX or XXXXX-XXXX
pattern = r'^\d{5}(-\d{4})?$'
return re.match(pattern, value) is not None
return True # Unknown format, skip validation
Multi-page forms
Handling multi-page forms
from pypdf import PdfReader, PdfWriter
reader = PdfReader("multi_page_form.pdf")
writer = PdfWriter()
# Clone all pages
for page in reader.pages:
writer.add_page(page)
# Fill fields on page 1
writer.update_page_form_field_values(
writer.pages[0],
{
"name_page1": "John Doe",
"email_page1": "john@example.com"
}
)
# Fill fields on page 2
writer.update_page_form_field_values(
writer.pages[1],
{
"address_page2": "123 Main St",
"city_page2": "Springfield"
}
)
# Fill fields on page 3
writer.update_page_form_field_values(
writer.pages[2],
{
"signature_page3": "John Doe",
"date_page3": "12/25/2024"
}
)
with open("filled_multi_page.pdf", "wb") as output:
writer.write(output)
Identifying page-specific fields
# Analyze which fields are on which pages
for page_num, page in enumerate(reader.pages, 1):
fields = page.get("/Annots", [])
if fields:
print(f"\nPage {page_num} fields:")
for field_ref in fields:
field = field_ref.get_object()
field_name = field.get("/T", "Unknown")
print(f" - {field_name}")
Flattening forms
Why flatten
Flattening makes form fields non-editable, embedding values permanently:
- Security: Prevent modifications
- Distribution: Share read-only forms
- Printing: Ensure correct appearance
- Archival: Long-term storage
Flatten with pypdf
from pypdf import PdfReader, PdfWriter
reader = PdfReader("filled.pdf")
writer = PdfWriter()
# Add all pages
for page in reader.pages:
writer.add_page(page)
# Flatten all form fields
writer.flatten_fields()
# Save flattened PDF
with open("flattened.pdf", "wb") as output:
writer.write(output)
Using included script
python scripts/flatten_form.py filled.pdf flattened.pdf
Error handling patterns
Robust form filling
import logging
from pathlib import Path
from pypdf import PdfReader, PdfWriter
from pypdf.errors import PdfReadError
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def fill_form_safe(template_path, data, output_path):
"""Fill form with comprehensive error handling."""
try:
# Validate inputs
template = Path(template_path)
if not template.exists():
raise FileNotFoundError(f"Template not found: {template_path}")
# Read template
logger.info(f"Reading template: {template_path}")
reader = PdfReader(template_path)
if not reader.pages:
raise ValueError("PDF has no pages")
# Check if form has fields
fields = reader.get_fields()
if not fields:
logger.warning("PDF has no form fields")
return False
# Create writer
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Validate data against schema
missing_required = []
invalid_fields = []
for field_name, field_info in fields.items():
# Check required fields
is_required = field_info.get("/Ff", 0) & 2 == 2
if is_required and field_name not in data:
missing_required.append(field_name)
# Check invalid field names in data
if field_name in data:
value = data[field_name]
# Add type validation here if needed
if missing_required:
raise ValueError(f"Missing required fields: {missing_required}")
# Fill fields
logger.info("Filling form fields")
writer.update_page_form_field_values(
writer.pages[0],
data
)
# Write output
logger.info(f"Writing output: {output_path}")
with open(output_path, "wb") as output:
writer.write(output)
logger.info("Form filled successfully")
return True
except PdfReadError as e:
logger.error(f"PDF read error: {e}")
return False
except FileNotFoundError as e:
logger.error(f"File error: {e}")
return False
except ValueError as e:
logger.error(f"Validation error: {e}")
return False
except Exception as e:
logger.error(f"Unexpected error: {e}")
return False
# Usage
success = fill_form_safe(
"template.pdf",
{"name": "John", "email": "john@example.com"},
"filled.pdf"
)
if not success:
exit(1)
Production examples
Example 1: Batch form processing
import json
import glob
from pathlib import Path
from fill_form_safe import fill_form_safe
# Process multiple submissions
submissions_dir = Path("submissions")
template = "application_template.pdf"
output_dir = Path("completed")
output_dir.mkdir(exist_ok=True)
for submission_file in submissions_dir.glob("*.json"):
print(f"Processing: {submission_file.name}")
# Load submission data
with open(submission_file) as f:
data = json.load(f)
# Fill form
applicant_id = data.get("id", "unknown")
output_file = output_dir / f"application_{applicant_id}.pdf"
success = fill_form_safe(template, data, output_file)
if success:
print(f" ✓ Completed: {output_file}")
else:
print(f" ✗ Failed: {submission_file.name}")
Example 2: Form with conditional logic
def prepare_form_data(raw_data):
"""Prepare form data with conditional logic."""
form_data = {}
# Basic fields
form_data["full_name"] = raw_data["name"]
form_data["email"] = raw_data["email"]
# Conditional fields
if raw_data.get("is_student"):
form_data["student_id"] = raw_data["student_id"]
form_data["school_name"] = raw_data["school"]
else:
form_data["employer"] = raw_data.get("employer", "")
# Checkbox logic
form_data["newsletter"] = "/Yes" if raw_data.get("opt_in") else "/Off"
# Calculated fields
total = sum(raw_data.get("items", []))
form_data["total_amount"] = f"${total:.2f}"
return form_data
# Usage
raw_input = {
"name": "Jane Smith",
"email": "jane@example.com",
"is_student": True,
"student_id": "12345",
"school": "State University",
"opt_in": True,
"items": [10.00, 25.50, 15.75]
}
form_data = prepare_form_data(raw_input)
fill_form_safe("template.pdf", form_data, "output.pdf")
Best practices
- Always analyze before filling: Use
analyze_form.pyto understand structure - Validate early: Check data before attempting to fill
- Use logging: Track operations for debugging
- Handle errors gracefully: Don't crash on invalid data
- Test with samples: Verify with small datasets first
- Flatten when distributing: Make read-only for recipients
- Keep templates versioned: Track form template changes
- Document field mappings: Maintain data-to-field documentation
Troubleshooting
Fields not filling
- Check field names match exactly (case-sensitive)
- Verify checkbox/radio values (
/Yes,/On, etc.) - Ensure PDF is not encrypted or protected
- Check if form uses XFA format (not supported by pypdf)
Encoding issues
# Handle special characters
field_values["name"] = "José García" # UTF-8 encoded
Large batch processing
# Process in chunks to avoid memory issues
chunk_size = 100
for i in range(0, len(submissions), chunk_size):
chunk = submissions[i:i + chunk_size]
process_batch(chunk)