zhongwei/gh-henkisdabro-wookstar-claude-code-plugins-documents-toolkit

Fork 0

Files

Zhongwei Li 7822766a14 Initial commit

2025-11-29 18:32:37 +08:00

14 KiB

Raw Permalink Blame History

PDF Form Processing Guide

Complete guide for processing PDF forms in production environments.

Form analysis and field detection
Form filling workflows
Validation strategies
Field types and handling
Multi-page forms
Flattening and finalization
Error handling patterns
Production examples

Form analysis

Analyze form structure

Use analyze_form.py to extract complete form information:

python scripts/analyze_form.py application.pdf --output schema.json

Output format:

{
  "full_name": {
    "type": "text",
    "required": true,
    "max_length": 100,
    "x": 120.5,
    "y": 450.2,
    "width": 300,
    "height": 20
  },
  "date_of_birth": {
    "type": "text",
    "required": true,
    "format": "MM/DD/YYYY",
    "x": 120.5,
    "y": 400.8,
    "width": 150,
    "height": 20
  },
  "email_newsletter": {
    "type": "checkbox",
    "required": false,
    "x": 120.5,
    "y": 350.4,
    "width": 15,
    "height": 15
  },
  "preferred_contact": {
    "type": "radio",
    "required": true,
    "options": ["email", "phone", "mail"],
    "x": 120.5,
    "y": 300.0,
    "width": 200,
    "height": 60
  }
}

Programmatic analysis

from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()

for field_name, field_info in fields.items():
    print(f"Field: {field_name}")
    print(f"  Type: {field_info.get('/FT')}")
    print(f"  Value: {field_info.get('/V')}")
    print(f"  Flags: {field_info.get('/Ff', 0)}")
    print()

Form filling workflows

Basic workflow

# 1. Analyze form
python scripts/analyze_form.py template.pdf --output schema.json

# 2. Prepare data
cat > data.json << EOF
{
  "full_name": "John Doe",
  "date_of_birth": "01/15/1990",
  "email": "john@example.com",
  "email_newsletter": true,
  "preferred_contact": "email"
}
EOF

# 3. Validate data
python scripts/validate_form.py data.json schema.json

# 4. Fill form
python scripts/fill_form.py template.pdf data.json filled.pdf

# 5. Flatten (optional - makes fields non-editable)
python scripts/flatten_form.py filled.pdf final.pdf

Programmatic filling

from pypdf import PdfReader, PdfWriter

reader = PdfReader("template.pdf")
writer = PdfWriter()

# Clone all pages
for page in reader.pages:
    writer.add_page(page)

# Fill form fields
writer.update_page_form_field_values(
    writer.pages[0],
    {
        "full_name": "John Doe",
        "date_of_birth": "01/15/1990",
        "email": "john@example.com",
        "email_newsletter": "/Yes",  # Checkbox value
        "preferred_contact": "/email"  # Radio value
    }
)

# Save filled form
with open("filled.pdf", "wb") as output:
    writer.write(output)

Field types and handling

Text fields

# Simple text
field_values["customer_name"] = "Jane Smith"

# Formatted text (dates)
field_values["date"] = "12/25/2024"

# Numbers
field_values["amount"] = "1234.56"

# Multi-line text
field_values["comments"] = "Line 1\nLine 2\nLine 3"

Checkboxes

Checkboxes typically use /Yes for checked, /Off for unchecked:

# Check checkbox
field_values["agree_to_terms"] = "/Yes"

# Uncheck checkbox
field_values["newsletter_opt_out"] = "/Off"

Note: Some PDFs use different values. Check with analyze_form.py:

{
  "some_checkbox": {
    "type": "checkbox",
    "on_value": "/On",   # ← Check this
    "off_value": "/Off"
  }
}

Radio buttons

Radio buttons are mutually exclusive options:

# Select one option from radio group
field_values["preferred_contact"] = "/email"

# Other options in same group
# field_values["preferred_contact"] = "/phone"
# field_values["preferred_contact"] = "/mail"

Dropdown/List boxes

# Single selection
field_values["country"] = "United States"

# List of available options in schema
"country": {
  "type": "dropdown",
  "options": ["United States", "Canada", "Mexico", ...]
}

Validation strategies

Schema-based validation

import json
from jsonschema import validate, ValidationError

# Load schema from analyze_form.py output
with open("schema.json") as f:
    schema = json.load(f)

# Load form data
with open("data.json") as f:
    data = json.load(f)

# Validate all fields
errors = []

for field_name, field_schema in schema.items():
    value = data.get(field_name)

    # Check required fields
    if field_schema.get("required") and not value:
        errors.append(f"Missing required field: {field_name}")

    # Check field type
    if value and field_schema.get("type") == "text":
        if not isinstance(value, str):
            errors.append(f"Field {field_name} must be string")

    # Check max length
    max_length = field_schema.get("max_length")
    if value and max_length and len(str(value)) > max_length:
        errors.append(f"Field {field_name} exceeds max length {max_length}")

    # Check format (dates, emails, etc)
    format_type = field_schema.get("format")
    if value and format_type:
        if not validate_format(value, format_type):
            errors.append(f"Field {field_name} has invalid format")

if errors:
    print("Validation errors:")
    for error in errors:
        print(f"  - {error}")
    exit(1)

print("Validation passed")

Format validation

import re
from datetime import datetime

def validate_format(value, format_type):
    """Validate field format."""

    if format_type == "email":
        pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        return re.match(pattern, value) is not None

    elif format_type == "phone":
        # US phone: (555) 123-4567 or 555-123-4567
        pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
        return re.match(pattern, value) is not None

    elif format_type == "MM/DD/YYYY":
        try:
            datetime.strptime(value, "%m/%d/%Y")
            return True
        except ValueError:
            return False

    elif format_type == "SSN":
        # XXX-XX-XXXX
        pattern = r'^\d{3}-\d{2}-\d{4}$'
        return re.match(pattern, value) is not None

    elif format_type == "ZIP":
        # XXXXX or XXXXX-XXXX
        pattern = r'^\d{5}(-\d{4})?$'
        return re.match(pattern, value) is not None

    return True  # Unknown format, skip validation

Multi-page forms

Handling multi-page forms

from pypdf import PdfReader, PdfWriter

reader = PdfReader("multi_page_form.pdf")
writer = PdfWriter()

# Clone all pages
for page in reader.pages:
    writer.add_page(page)

# Fill fields on page 1
writer.update_page_form_field_values(
    writer.pages[0],
    {
        "name_page1": "John Doe",
        "email_page1": "john@example.com"
    }
)

# Fill fields on page 2
writer.update_page_form_field_values(
    writer.pages[1],
    {
        "address_page2": "123 Main St",
        "city_page2": "Springfield"
    }
)

# Fill fields on page 3
writer.update_page_form_field_values(
    writer.pages[2],
    {
        "signature_page3": "John Doe",
        "date_page3": "12/25/2024"
    }
)

with open("filled_multi_page.pdf", "wb") as output:
    writer.write(output)

Identifying page-specific fields

# Analyze which fields are on which pages
for page_num, page in enumerate(reader.pages, 1):
    fields = page.get("/Annots", [])

    if fields:
        print(f"\nPage {page_num} fields:")
        for field_ref in fields:
            field = field_ref.get_object()
            field_name = field.get("/T", "Unknown")
            print(f"  - {field_name}")

Flattening forms

Why flatten

Flattening makes form fields non-editable, embedding values permanently:

Security: Prevent modifications
Distribution: Share read-only forms
Printing: Ensure correct appearance
Archival: Long-term storage

Flatten with pypdf

from pypdf import PdfReader, PdfWriter

reader = PdfReader("filled.pdf")
writer = PdfWriter()

# Add all pages
for page in reader.pages:
    writer.add_page(page)

# Flatten all form fields
writer.flatten_fields()

# Save flattened PDF
with open("flattened.pdf", "wb") as output:
    writer.write(output)

Using included script

python scripts/flatten_form.py filled.pdf flattened.pdf

Error handling patterns

Robust form filling

import logging
from pathlib import Path
from pypdf import PdfReader, PdfWriter
from pypdf.errors import PdfReadError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def fill_form_safe(template_path, data, output_path):
    """Fill form with comprehensive error handling."""

    try:
        # Validate inputs
        template = Path(template_path)
        if not template.exists():
            raise FileNotFoundError(f"Template not found: {template_path}")

        # Read template
        logger.info(f"Reading template: {template_path}")
        reader = PdfReader(template_path)

        if not reader.pages:
            raise ValueError("PDF has no pages")

        # Check if form has fields
        fields = reader.get_fields()
        if not fields:
            logger.warning("PDF has no form fields")
            return False

        # Create writer
        writer = PdfWriter()
        for page in reader.pages:
            writer.add_page(page)

        # Validate data against schema
        missing_required = []
        invalid_fields = []

        for field_name, field_info in fields.items():
            # Check required fields
            is_required = field_info.get("/Ff", 0) & 2 == 2
            if is_required and field_name not in data:
                missing_required.append(field_name)

            # Check invalid field names in data
            if field_name in data:
                value = data[field_name]
                # Add type validation here if needed

        if missing_required:
            raise ValueError(f"Missing required fields: {missing_required}")

        # Fill fields
        logger.info("Filling form fields")
        writer.update_page_form_field_values(
            writer.pages[0],
            data
        )

        # Write output
        logger.info(f"Writing output: {output_path}")
        with open(output_path, "wb") as output:
            writer.write(output)

        logger.info("Form filled successfully")
        return True

    except PdfReadError as e:
        logger.error(f"PDF read error: {e}")
        return False

    except FileNotFoundError as e:
        logger.error(f"File error: {e}")
        return False

    except ValueError as e:
        logger.error(f"Validation error: {e}")
        return False

    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        return False

# Usage
success = fill_form_safe(
    "template.pdf",
    {"name": "John", "email": "john@example.com"},
    "filled.pdf"
)

if not success:
    exit(1)

Production examples

Example 1: Batch form processing

import json
import glob
from pathlib import Path
from fill_form_safe import fill_form_safe

# Process multiple submissions
submissions_dir = Path("submissions")
template = "application_template.pdf"
output_dir = Path("completed")
output_dir.mkdir(exist_ok=True)

for submission_file in submissions_dir.glob("*.json"):
    print(f"Processing: {submission_file.name}")

    # Load submission data
    with open(submission_file) as f:
        data = json.load(f)

    # Fill form
    applicant_id = data.get("id", "unknown")
    output_file = output_dir / f"application_{applicant_id}.pdf"

    success = fill_form_safe(template, data, output_file)

    if success:
        print(f"  ✓ Completed: {output_file}")
    else:
        print(f"  ✗ Failed: {submission_file.name}")

Example 2: Form with conditional logic

def prepare_form_data(raw_data):
    """Prepare form data with conditional logic."""

    form_data = {}

    # Basic fields
    form_data["full_name"] = raw_data["name"]
    form_data["email"] = raw_data["email"]

    # Conditional fields
    if raw_data.get("is_student"):
        form_data["student_id"] = raw_data["student_id"]
        form_data["school_name"] = raw_data["school"]
    else:
        form_data["employer"] = raw_data.get("employer", "")

    # Checkbox logic
    form_data["newsletter"] = "/Yes" if raw_data.get("opt_in") else "/Off"

    # Calculated fields
    total = sum(raw_data.get("items", []))
    form_data["total_amount"] = f"${total:.2f}"

    return form_data

# Usage
raw_input = {
    "name": "Jane Smith",
    "email": "jane@example.com",
    "is_student": True,
    "student_id": "12345",
    "school": "State University",
    "opt_in": True,
    "items": [10.00, 25.50, 15.75]
}

form_data = prepare_form_data(raw_input)
fill_form_safe("template.pdf", form_data, "output.pdf")

Best practices

Always analyze before filling: Use analyze_form.py to understand structure
Validate early: Check data before attempting to fill
Use logging: Track operations for debugging
Handle errors gracefully: Don't crash on invalid data
Test with samples: Verify with small datasets first
Flatten when distributing: Make read-only for recipients
Keep templates versioned: Track form template changes
Document field mappings: Maintain data-to-field documentation

Troubleshooting

Fields not filling

Check field names match exactly (case-sensitive)
Verify checkbox/radio values (/Yes, /On, etc.)
Ensure PDF is not encrypted or protected
Check if form uses XFA format (not supported by pypdf)

Encoding issues

# Handle special characters
field_values["name"] = "José García"  # UTF-8 encoded

Large batch processing

# Process in chunks to avoid memory issues
chunk_size = 100

for i in range(0, len(submissions), chunk_size):
    chunk = submissions[i:i + chunk_size]
    process_batch(chunk)

14 KiB Raw Permalink Blame History