Files
2025-11-30 08:25:58 +08:00

8.2 KiB

Non-fillable PDF Forms - Extraction Guide

You'll need to visually determine where the data should be added as text annotations. Follow the below steps exactly. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.

  • Convert the PDF to PNG images and determine field bounding boxes.
  • Create a JSON file with field information and validation images showing the bounding boxes.
  • Validate the the bounding boxes.

Process Overview

@startuml Nonfillable-Forms
title Non-fillable PDF Forms - Extraction Workflow
start
:Convert PDF to PNG images;
:Visual analysis & determine bounding boxes
in IMAGE coordinates;
:Create .scan.json;
repeat
  :Automated intersection check
  on image coordinates;
  if (Automated check passes?) then (yes)
    :Create validation images
    (overlay on PNGs);
    :Manual image inspection;
    if (Manual check passes?) then (yes)
    else (no)
      :Fix bounding boxes in .scan.json;
    endif
  else (no)
    :Fix bounding boxes in .scan.json;
  endif
repeat while (Both checks pass?) is (no)
->yes;
:Convert coordinates
(.scan.json → .form.json);
:**✓ NON-FILLABLE EXTRACTION COMPLETE**
.form.json ready with PDF coordinates;
stop
@enduml

Extraction Process

Step 1: Visual Analysis (REQUIRED)

  • Convert the PDF to PNG images. Run this script from this skill's directory:
python scripts/convert_pdf_to_images.py <basename>.pdf <basename>.chatfield/

The script will create a PNG image for each page.

  • Read and analyze the the .form.md file which is a Markdown text preview of the PDF content
  • Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter information, determine bounding boxes, in the image coordinate system, for both the field label and the input entry area. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.

These are some examples of form structures that you might see (in English, but the form can be any language):

Label inside box

┌────────────────────────┐
│ Name:                  │
└────────────────────────┘

The input area should be to the right of the "Name" label and extend to the edge of the box.

Label before line

Email: _______________________

The input area should be above the line and include its entire width.

Label under line

_________________________
Name

The input area should be above the line and include the entire width of the line. This is common for signature and date fields.

Label above line

Please enter any special requests:
________________________________________________

The input area should extend from the bottom of the label to the line, and should include the entire width of the line.

Checkboxes

Are you a US citizen? Yes □  No □

For checkboxes:

  • Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
  • Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
  • The entry bounding box should cover ONLY the small square, not the text label.

Step 2: Create .scan.json

Create <basename>.chatfield/<basename>.scan.json formatted like the below example. Rectangle values are IMAGE coordinates (what you see directly in the PNG, top-left origin).

[
  {
    "field_id": "full_name",
    "type": "text",
    "page": 1,
    "rect": [180, 200, 550, 220],
    "label_text": "Full Name:",
    "label_rect": [50, 200, 175, 220]
  },
  {
    "field_id": "is_citizen",
    "type": "checkbox",
    "page": 1,
    "rect": [60, 320, 75, 335],
    "label_text": "US Citizen",
    "label_rect": [80, 320, 150, 335],
    "checked_value": "X",
    "unchecked_value": ""
  }
]

Field structure:

  • field_id - Unique identifier (will be used in chatfield definition)
    • CRITICAL: Every field MUST have a unique field_id with no collisions
    • Field IDs are internal identifiers, not user-facing
  • type - "text" or "checkbox"
  • page - Page number (1-indexed)
  • rect - Entry area bounding box [x1, y1, x2, y2] where data will be written
  • label_text - Optional label text for this field
  • label_rect - Optional label bounding box [x1, y1, x2, y2]
  • For checkboxes only:
    • checked_value - String to write when checked (typically "X" or "✓")
    • unchecked_value - String to write when unchecked (typically "")

Bounding box coordinates (IMAGE COORDINATES):

  • Image coordinate system: Origin (0,0) at top-left
  • X increases to the right, Y increases downward
  • Format: [x1, y1, x2, y2] where (x1,y1) is top-left corner, (x2,y2) is bottom-right corner
  • These are the pixel coordinates you see directly in the PNG image
  • Entry boxes (rect) must be tall and wide enough to contain text
  • Label boxes (label_rect) should contain the label text
  • Entry and label boxes MUST NOT overlap
  • Checkboxes should be at least 10-20 pixels square

Step 3: Validate Bounding Boxes (REQUIRED)

This is a two-stage validation process. You must pass the automated check before proceeding to manual inspection.

Stage 1: Automated intersection check

Run the automated check script:

python scripts/check_bounding_boxes.py <basename>.chatfield/<basename>.scan.json

What it checks:

  • Label/entry bounding box intersections (must not overlap)
  • Boxes too small to contain text
  • Missing required fields

If there are errors: Fix the bounding boxes in .scan.json and re-run the automated check. Iterate until there are no remaining errors.

Only proceed to Stage 2 once all automated checks pass.

Stage 2: Manual image inspection

Create validation images for each page:

# For each page (e.g., if you have 3 pages)
python scripts/create_validation_image.py 1 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_1.png <basename>.chatfield/page_1_validation.png
python scripts/create_validation_image.py 2 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_2.png <basename>.chatfield/page_2_validation.png
python scripts/create_validation_image.py 3 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_3.png <basename>.chatfield/page_3_validation.png

This overlays colored rectangles (red for entry boxes, blue for labels) on the PNG images to visualize bounding boxes.

CRITICAL: Visually inspect validation images

Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.

  • Red rectangles must ONLY cover input areas
  • Red rectangles MUST NOT contain any text or labels
  • Blue rectangles should contain label text
  • For checkboxes:
    • Red rectangle MUST be centered on the checkbox square
    • Blue rectangle should cover the text label for the checkbox

If any rectangles look wrong: Fix bounding boxes in .scan.json, then return to Stage 1 (automated check gate). You must pass both stages again.

Step 4: Convert to PDF Coordinates

Once all validation passes, convert the image coordinates to PDF coordinates:

python scripts/convert_coordinates.py <basename>.chatfield/<basename>.scan.json <basename>.pdf

Troubleshooting

Bounding boxes don't align in validation images:

  • Review the validation image carefully
  • Adjust coordinates in .scan.json
  • Remember: You're using IMAGE coordinates (origin at top-left, Y downward)
  • Re-run validation after changes

Text gets cut off:

  • Increase bounding box height and/or width in .scan.json
  • Entry boxes should have extra space for text

Validation script errors:

  • Ensure all page images exist in <basename>.chatfield/
  • Verify JSON syntax in .scan.json
  • Check that page numbers are 1-indexed

See Also:

  • ../../filling-pdf-forms/references/Converting-PDF-To-Chatfield.md - How the main skill builds the interview
  • ./Fillable-Forms.md - Alternative extraction for fillable PDFs
  • ../../filling-pdf-forms/references/populating.md - How bounding boxes are used during PDF population