Initial commit

2025-11-29 18:32:37 +08:00
commit 7822766a14
70 changed files with 27471 additions and 0 deletions
--- a/skills/pdf-processing-pro/FORMS.md
+++ b/skills/pdf-processing-pro/FORMS.md
@@ -0,0 +1,610 @@
+# PDF Form Processing Guide
+
+Complete guide for processing PDF forms in production environments.
+
+## Table of contents
+
+- Form analysis and field detection
+- Form filling workflows
+- Validation strategies
+- Field types and handling
+- Multi-page forms
+- Flattening and finalization
+- Error handling patterns
+- Production examples
+
+## Form analysis
+
+### Analyze form structure
+
+Use `analyze_form.py` to extract complete form information:
+
+```bash
+python scripts/analyze_form.py application.pdf --output schema.json
+```
+
+Output format:
+
+```json
+{
+  "full_name": {
+    "type": "text",
+    "required": true,
+    "max_length": 100,
+    "x": 120.5,
+    "y": 450.2,
+    "width": 300,
+    "height": 20
+  },
+  "date_of_birth": {
+    "type": "text",
+    "required": true,
+    "format": "MM/DD/YYYY",
+    "x": 120.5,
+    "y": 400.8,
+    "width": 150,
+    "height": 20
+  },
+  "email_newsletter": {
+    "type": "checkbox",
+    "required": false,
+    "x": 120.5,
+    "y": 350.4,
+    "width": 15,
+    "height": 15
+  },
+  "preferred_contact": {
+    "type": "radio",
+    "required": true,
+    "options": ["email", "phone", "mail"],
+    "x": 120.5,
+    "y": 300.0,
+    "width": 200,
+    "height": 60
+  }
+}
+```
+
+### Programmatic analysis
+
+```python
+from pypdf import PdfReader
+
+reader = PdfReader("form.pdf")
+fields = reader.get_fields()
+
+for field_name, field_info in fields.items():
+    print(f"Field: {field_name}")
+    print(f"  Type: {field_info.get('/FT')}")
+    print(f"  Value: {field_info.get('/V')}")
+    print(f"  Flags: {field_info.get('/Ff', 0)}")
+    print()
+```
+
+## Form filling workflows
+
+### Basic workflow
+
+```bash
+# 1. Analyze form
+python scripts/analyze_form.py template.pdf --output schema.json
+
+# 2. Prepare data
+cat > data.json << EOF
+{
+  "full_name": "John Doe",
+  "date_of_birth": "01/15/1990",
+  "email": "john@example.com",
+  "email_newsletter": true,
+  "preferred_contact": "email"
+}
+EOF
+
+# 3. Validate data
+python scripts/validate_form.py data.json schema.json
+
+# 4. Fill form
+python scripts/fill_form.py template.pdf data.json filled.pdf
+
+# 5. Flatten (optional - makes fields non-editable)
+python scripts/flatten_form.py filled.pdf final.pdf
+```
+
+### Programmatic filling
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+reader = PdfReader("template.pdf")
+writer = PdfWriter()
+
+# Clone all pages
+for page in reader.pages:
+    writer.add_page(page)
+
+# Fill form fields
+writer.update_page_form_field_values(
+    writer.pages[0],
+    {
+        "full_name": "John Doe",
+        "date_of_birth": "01/15/1990",
+        "email": "john@example.com",
+        "email_newsletter": "/Yes",  # Checkbox value
+        "preferred_contact": "/email"  # Radio value
+    }
+)
+
+# Save filled form
+with open("filled.pdf", "wb") as output:
+    writer.write(output)
+```
+
+## Field types and handling
+
+### Text fields
+
+```python
+# Simple text
+field_values["customer_name"] = "Jane Smith"
+
+# Formatted text (dates)
+field_values["date"] = "12/25/2024"
+
+# Numbers
+field_values["amount"] = "1234.56"
+
+# Multi-line text
+field_values["comments"] = "Line 1\nLine 2\nLine 3"
+```
+
+### Checkboxes
+
+Checkboxes typically use `/Yes` for checked, `/Off` for unchecked:
+
+```python
+# Check checkbox
+field_values["agree_to_terms"] = "/Yes"
+
+# Uncheck checkbox
+field_values["newsletter_opt_out"] = "/Off"
+```
+
+**Note**: Some PDFs use different values. Check with `analyze_form.py`:
+
+```json
+{
+  "some_checkbox": {
+    "type": "checkbox",
+    "on_value": "/On",   # ← Check this
+    "off_value": "/Off"
+  }
+}
+```
+
+### Radio buttons
+
+Radio buttons are mutually exclusive options:
+
+```python
+# Select one option from radio group
+field_values["preferred_contact"] = "/email"
+
+# Other options in same group
+# field_values["preferred_contact"] = "/phone"
+# field_values["preferred_contact"] = "/mail"
+```
+
+### Dropdown/List boxes
+
+```python
+# Single selection
+field_values["country"] = "United States"
+
+# List of available options in schema
+"country": {
+  "type": "dropdown",
+  "options": ["United States", "Canada", "Mexico", ...]
+}
+```
+
+## Validation strategies
+
+### Schema-based validation
+
+```python
+import json
+from jsonschema import validate, ValidationError
+
+# Load schema from analyze_form.py output
+with open("schema.json") as f:
+    schema = json.load(f)
+
+# Load form data
+with open("data.json") as f:
+    data = json.load(f)
+
+# Validate all fields
+errors = []
+
+for field_name, field_schema in schema.items():
+    value = data.get(field_name)
+
+    # Check required fields
+    if field_schema.get("required") and not value:
+        errors.append(f"Missing required field: {field_name}")
+
+    # Check field type
+    if value and field_schema.get("type") == "text":
+        if not isinstance(value, str):
+            errors.append(f"Field {field_name} must be string")
+
+    # Check max length
+    max_length = field_schema.get("max_length")
+    if value and max_length and len(str(value)) > max_length:
+        errors.append(f"Field {field_name} exceeds max length {max_length}")
+
+    # Check format (dates, emails, etc)
+    format_type = field_schema.get("format")
+    if value and format_type:
+        if not validate_format(value, format_type):
+            errors.append(f"Field {field_name} has invalid format")
+
+if errors:
+    print("Validation errors:")
+    for error in errors:
+        print(f"  - {error}")
+    exit(1)
+
+print("Validation passed")
+```
+
+### Format validation
+
+```python
+import re
+from datetime import datetime
+
+def validate_format(value, format_type):
+    """Validate field format."""
+
+    if format_type == "email":
+        pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
+        return re.match(pattern, value) is not None
+
+    elif format_type == "phone":
+        # US phone: (555) 123-4567 or 555-123-4567
+        pattern = r'^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$'
+        return re.match(pattern, value) is not None
+
+    elif format_type == "MM/DD/YYYY":
+        try:
+            datetime.strptime(value, "%m/%d/%Y")
+            return True
+        except ValueError:
+            return False
+
+    elif format_type == "SSN":
+        # XXX-XX-XXXX
+        pattern = r'^\d{3}-\d{2}-\d{4}$'
+        return re.match(pattern, value) is not None
+
+    elif format_type == "ZIP":
+        # XXXXX or XXXXX-XXXX
+        pattern = r'^\d{5}(-\d{4})?$'
+        return re.match(pattern, value) is not None
+
+    return True  # Unknown format, skip validation
+```
+
+## Multi-page forms
+
+### Handling multi-page forms
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+reader = PdfReader("multi_page_form.pdf")
+writer = PdfWriter()
+
+# Clone all pages
+for page in reader.pages:
+    writer.add_page(page)
+
+# Fill fields on page 1
+writer.update_page_form_field_values(
+    writer.pages[0],
+    {
+        "name_page1": "John Doe",
+        "email_page1": "john@example.com"
+    }
+)
+
+# Fill fields on page 2
+writer.update_page_form_field_values(
+    writer.pages[1],
+    {
+        "address_page2": "123 Main St",
+        "city_page2": "Springfield"
+    }
+)
+
+# Fill fields on page 3
+writer.update_page_form_field_values(
+    writer.pages[2],
+    {
+        "signature_page3": "John Doe",
+        "date_page3": "12/25/2024"
+    }
+)
+
+with open("filled_multi_page.pdf", "wb") as output:
+    writer.write(output)
+```
+
+### Identifying page-specific fields
+
+```python
+# Analyze which fields are on which pages
+for page_num, page in enumerate(reader.pages, 1):
+    fields = page.get("/Annots", [])
+
+    if fields:
+        print(f"\nPage {page_num} fields:")
+        for field_ref in fields:
+            field = field_ref.get_object()
+            field_name = field.get("/T", "Unknown")
+            print(f"  - {field_name}")
+```
+
+## Flattening forms
+
+### Why flatten
+
+Flattening makes form fields non-editable, embedding values permanently:
+
+- **Security**: Prevent modifications
+- **Distribution**: Share read-only forms
+- **Printing**: Ensure correct appearance
+- **Archival**: Long-term storage
+
+### Flatten with pypdf
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+reader = PdfReader("filled.pdf")
+writer = PdfWriter()
+
+# Add all pages
+for page in reader.pages:
+    writer.add_page(page)
+
+# Flatten all form fields
+writer.flatten_fields()
+
+# Save flattened PDF
+with open("flattened.pdf", "wb") as output:
+    writer.write(output)
+```
+
+### Using included script
+
+```bash
+python scripts/flatten_form.py filled.pdf flattened.pdf
+```
+
+## Error handling patterns
+
+### Robust form filling
+
+```python
+import logging
+from pathlib import Path
+from pypdf import PdfReader, PdfWriter
+from pypdf.errors import PdfReadError
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+def fill_form_safe(template_path, data, output_path):
+    """Fill form with comprehensive error handling."""
+
+    try:
+        # Validate inputs
+        template = Path(template_path)
+        if not template.exists():
+            raise FileNotFoundError(f"Template not found: {template_path}")
+
+        # Read template
+        logger.info(f"Reading template: {template_path}")
+        reader = PdfReader(template_path)
+
+        if not reader.pages:
+            raise ValueError("PDF has no pages")
+
+        # Check if form has fields
+        fields = reader.get_fields()
+        if not fields:
+            logger.warning("PDF has no form fields")
+            return False
+
+        # Create writer
+        writer = PdfWriter()
+        for page in reader.pages:
+            writer.add_page(page)
+
+        # Validate data against schema
+        missing_required = []
+        invalid_fields = []
+
+        for field_name, field_info in fields.items():
+            # Check required fields
+            is_required = field_info.get("/Ff", 0) & 2 == 2
+            if is_required and field_name not in data:
+                missing_required.append(field_name)
+
+            # Check invalid field names in data
+            if field_name in data:
+                value = data[field_name]
+                # Add type validation here if needed
+
+        if missing_required:
+            raise ValueError(f"Missing required fields: {missing_required}")
+
+        # Fill fields
+        logger.info("Filling form fields")
+        writer.update_page_form_field_values(
+            writer.pages[0],
+            data
+        )
+
+        # Write output
+        logger.info(f"Writing output: {output_path}")
+        with open(output_path, "wb") as output:
+            writer.write(output)
+
+        logger.info("Form filled successfully")
+        return True
+
+    except PdfReadError as e:
+        logger.error(f"PDF read error: {e}")
+        return False
+
+    except FileNotFoundError as e:
+        logger.error(f"File error: {e}")
+        return False
+
+    except ValueError as e:
+        logger.error(f"Validation error: {e}")
+        return False
+
+    except Exception as e:
+        logger.error(f"Unexpected error: {e}")
+        return False
+
+# Usage
+success = fill_form_safe(
+    "template.pdf",
+    {"name": "John", "email": "john@example.com"},
+    "filled.pdf"
+)
+
+if not success:
+    exit(1)
+```
+
+## Production examples
+
+### Example 1: Batch form processing
+
+```python
+import json
+import glob
+from pathlib import Path
+from fill_form_safe import fill_form_safe
+
+# Process multiple submissions
+submissions_dir = Path("submissions")
+template = "application_template.pdf"
+output_dir = Path("completed")
+output_dir.mkdir(exist_ok=True)
+
+for submission_file in submissions_dir.glob("*.json"):
+    print(f"Processing: {submission_file.name}")
+
+    # Load submission data
+    with open(submission_file) as f:
+        data = json.load(f)
+
+    # Fill form
+    applicant_id = data.get("id", "unknown")
+    output_file = output_dir / f"application_{applicant_id}.pdf"
+
+    success = fill_form_safe(template, data, output_file)
+
+    if success:
+        print(f"  ✓ Completed: {output_file}")
+    else:
+        print(f"  ✗ Failed: {submission_file.name}")
+```
+
+### Example 2: Form with conditional logic
+
+```python
+def prepare_form_data(raw_data):
+    """Prepare form data with conditional logic."""
+
+    form_data = {}
+
+    # Basic fields
+    form_data["full_name"] = raw_data["name"]
+    form_data["email"] = raw_data["email"]
+
+    # Conditional fields
+    if raw_data.get("is_student"):
+        form_data["student_id"] = raw_data["student_id"]
+        form_data["school_name"] = raw_data["school"]
+    else:
+        form_data["employer"] = raw_data.get("employer", "")
+
+    # Checkbox logic
+    form_data["newsletter"] = "/Yes" if raw_data.get("opt_in") else "/Off"
+
+    # Calculated fields
+    total = sum(raw_data.get("items", []))
+    form_data["total_amount"] = f"${total:.2f}"
+
+    return form_data
+
+# Usage
+raw_input = {
+    "name": "Jane Smith",
+    "email": "jane@example.com",
+    "is_student": True,
+    "student_id": "12345",
+    "school": "State University",
+    "opt_in": True,
+    "items": [10.00, 25.50, 15.75]
+}
+
+form_data = prepare_form_data(raw_input)
+fill_form_safe("template.pdf", form_data, "output.pdf")
+```
+
+## Best practices
+
+1. **Always analyze before filling**: Use `analyze_form.py` to understand structure
+2. **Validate early**: Check data before attempting to fill
+3. **Use logging**: Track operations for debugging
+4. **Handle errors gracefully**: Don't crash on invalid data
+5. **Test with samples**: Verify with small datasets first
+6. **Flatten when distributing**: Make read-only for recipients
+7. **Keep templates versioned**: Track form template changes
+8. **Document field mappings**: Maintain data-to-field documentation
+
+## Troubleshooting
+
+### Fields not filling
+
+1. Check field names match exactly (case-sensitive)
+2. Verify checkbox/radio values (`/Yes`, `/On`, etc.)
+3. Ensure PDF is not encrypted or protected
+4. Check if form uses XFA format (not supported by pypdf)
+
+### Encoding issues
+
+```python
+# Handle special characters
+field_values["name"] = "José García"  # UTF-8 encoded
+```
+
+### Large batch processing
+
+```python
+# Process in chunks to avoid memory issues
+chunk_size = 100
+
+for i in range(0, len(submissions), chunk_size):
+    chunk = submissions[i:i + chunk_size]
+    process_batch(chunk)
+```