Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:25:58 +08:00
commit e13b6ff259
31 changed files with 3185 additions and 0 deletions

View File

@@ -0,0 +1,2 @@
*.png
*.plantuml

View File

@@ -0,0 +1,117 @@
---
name: extracting-form-fields
description: Extract form field data from PDFs as a first step to filling PDF forms
allowed-tools: Read, Write, Edit, Glob, Bash
version: 1.0.0a2
license: Apache 2.0
---
# Extracting Form Fields
Prepare working directory and extract field data from PDF forms.
<purpose>
This skill extracts PDF form information into useful JSON.
- Detects fillable vs. non-fillable PDFs
- Extracts PDF content as readable Markdown
- Creates field metadata in common JSON format
</purpose>
## Inputs
- **PDF path**: Path to PDF file (e.g., `/home/user/input.pdf`)
## Process Overview
```plantuml
@startuml SKILL
title Extracting Form Fields - High-Level Workflow
start
:Create working directory;
:Copy interview template;
:Extract PDF content as Markdown;
:Check Fillability;
if (PDF has fillable fields?) then (yes)
:Fillable workflow
(see Fillable-Forms.md);
else (no)
:Non-fillable workflow
(see Nonfillable-Forms.md);
endif
:**✓ EXTRACTION COMPLETE**;
:Ready for Form Data Model creation;
stop
@enduml
```
## Process
### 1. Create Working Directory
```bash
mkdir <basename>.chatfield
```
### 2. Copy Interview Template
Copy a file from the included `filling-pdf-forms` skill's template. The example path below is relative to this skill directory.
```bash
cp ../filling-pdf-forms/scripts/chatfield_interview_template.py <basename>.chatfield/interview.py
```
### 3. Extract PDF Content
```bash
markitdown <pdf_path> > <basename>.chatfield/<basename>.form.md
```
### 4. Check Fillability
```bash
python scripts/check_fillable_fields.py <pdf_path>
```
**Output:**
- `"This PDF has fillable form fields"` → use fillable workflow
- `"This PDF does not have fillable form fields"` → use non-fillable workflow
### 5. Branch Based on Fillability
#### If Fillable:
Follow ./references/Fillable-Forms.md
#### If Non-fillable:
Follow ./references/Nonfillable-Forms.md
## Output Format
### Fillable PDFs - .form.json
```json
[
{
"field_id": "topmostSubform[0].Page1[0].f1_01[0]",
"type": "text",
"page": 1,
"rect": [100, 200, 300, 220],
"tooltip": "Enter your full legal name",
"max_length": null
},
{
"field_id": "checkbox_over_18",
"type": "checkbox",
"page": 1,
"rect": [150, 250, 165, 265],
"checked_value": "/1",
"unchecked_value": "/Off"
}
]
```
## References
- ./references/Fillable-Forms.md - Fillable PDF extraction workflow
- ./references/Nonfillable-Forms.md - Non-fillable PDF extraction workflow

View File

@@ -0,0 +1,29 @@
# Fillable PDF Forms - Extraction Guide
This guide is for the "extracting-form-fields" agent performing extraction on fillable PDFs.
## Process Overview
```plantuml
@startuml Fillable-Forms
title Fillable PDF Forms - Extraction Workflow
start
:Extract form field metadata;
:**✓ FILLABLE EXTRACTION COMPLETE**;
stop
@enduml
```
## Extraction Process
### 1. Extract Form Field Metadata
```bash
python scripts/extract_form_field_info.py input.pdf input.chatfield/input.form.json
```
This creates a JSON file with field metadata:
## Completion Report
After extraction, simply state "Done". If there is an unrecoverable error, halt and report the error verbatim.

View File

@@ -0,0 +1,218 @@
# Non-fillable PDF Forms - Extraction Guide
You'll need to visually determine where the data should be added as text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.
- Convert the PDF to PNG images and determine field bounding boxes.
- Create a JSON file with field information and validation images showing the bounding boxes.
- Validate the the bounding boxes.
## Process Overview
```plantuml
@startuml Nonfillable-Forms
title Non-fillable PDF Forms - Extraction Workflow
start
:Convert PDF to PNG images;
:Visual analysis & determine bounding boxes
in IMAGE coordinates;
:Create .scan.json;
repeat
:Automated intersection check
on image coordinates;
if (Automated check passes?) then (yes)
:Create validation images
(overlay on PNGs);
:Manual image inspection;
if (Manual check passes?) then (yes)
else (no)
:Fix bounding boxes in .scan.json;
endif
else (no)
:Fix bounding boxes in .scan.json;
endif
repeat while (Both checks pass?) is (no)
->yes;
:Convert coordinates
(.scan.json → .form.json);
:**✓ NON-FILLABLE EXTRACTION COMPLETE**
.form.json ready with PDF coordinates;
stop
@enduml
```
## Extraction Process
## Step 1: Visual Analysis (REQUIRED)
- Convert the PDF to PNG images. Run this script from this skill's directory:
```bash
python scripts/convert_pdf_to_images.py <basename>.pdf <basename>.chatfield/
```
The script will create a PNG image for each page.
- Read and analyze the the .form.md file which is a Markdown text preview of the PDF content
- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter information, determine bounding boxes, in the image coordinate system, for both the field label and the input entry area. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.
These are some examples of form structures that you might see (in English, but the form can be any language):
*Label inside box*
```
┌────────────────────────┐
│ Name: │
└────────────────────────┘
```
The input area should be to the right of the "Name" label and extend to the edge of the box.
*Label before line*
```
Email: _______________________
```
The input area should be above the line and include its entire width.
*Label under line*
```
_________________________
Name
```
The input area should be above the line and include the entire width of the line. This is common for signature and date fields.
*Label above line*
```
Please enter any special requests:
________________________________________________
```
The input area should extend from the bottom of the label to the line, and should include the entire width of the line.
*Checkboxes*
```
Are you a US citizen? Yes □ No □
```
For checkboxes:
- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
- Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
- The entry bounding box should cover ONLY the small square, not the text label.
## Step 2: Create .scan.json
Create `<basename>.chatfield/<basename>.scan.json` formatted like the below example. Rectangle values are **IMAGE coordinates** (what you see directly in the PNG, top-left origin).
```json
[
{
"field_id": "full_name",
"type": "text",
"page": 1,
"rect": [180, 200, 550, 220],
"label_text": "Full Name:",
"label_rect": [50, 200, 175, 220]
},
{
"field_id": "is_citizen",
"type": "checkbox",
"page": 1,
"rect": [60, 320, 75, 335],
"label_text": "US Citizen",
"label_rect": [80, 320, 150, 335],
"checked_value": "X",
"unchecked_value": ""
}
]
```
**Field structure:**
- `field_id` - Unique identifier (will be used in chatfield definition)
- **CRITICAL:** Every field MUST have a unique field_id with no collisions
- Field IDs are internal identifiers, not user-facing
- `type` - "text" or "checkbox"
- `page` - Page number (1-indexed)
- `rect` - Entry area bounding box [x1, y1, x2, y2] where data will be written
- `label_text` - Optional label text for this field
- `label_rect` - Optional label bounding box [x1, y1, x2, y2]
- For checkboxes only:
- `checked_value` - String to write when checked (typically "X" or "✓")
- `unchecked_value` - String to write when unchecked (typically "")
**Bounding box coordinates (IMAGE COORDINATES):**
- Image coordinate system: Origin (0,0) at top-left
- X increases to the right, Y increases downward
- Format: `[x1, y1, x2, y2]` where (x1,y1) is top-left corner, (x2,y2) is bottom-right corner
- These are the pixel coordinates you see directly in the PNG image
- Entry boxes (`rect`) must be tall and wide enough to contain text
- Label boxes (`label_rect`) should contain the label text
- Entry and label boxes MUST NOT overlap
- Checkboxes should be at least 10-20 pixels square
## Step 3: Validate Bounding Boxes (REQUIRED)
This is a two-stage validation process. You must pass the automated check before proceeding to manual inspection.
### Stage 1: Automated intersection check
Run the automated check script:
```bash
python scripts/check_bounding_boxes.py <basename>.chatfield/<basename>.scan.json
```
**What it checks:**
- Label/entry bounding box intersections (must not overlap)
- Boxes too small to contain text
- Missing required fields
**If there are errors:** Fix the bounding boxes in `.scan.json` and re-run the automated check. Iterate until there are no remaining errors.
**Only proceed to Stage 2 once all automated checks pass.**
### Stage 2: Manual image inspection
Create validation images for each page:
```bash
# For each page (e.g., if you have 3 pages)
python scripts/create_validation_image.py 1 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_1.png <basename>.chatfield/page_1_validation.png
python scripts/create_validation_image.py 2 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_2.png <basename>.chatfield/page_2_validation.png
python scripts/create_validation_image.py 3 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_3.png <basename>.chatfield/page_3_validation.png
```
This overlays colored rectangles (red for entry boxes, blue for labels) on the PNG images to visualize bounding boxes.
**CRITICAL: Visually inspect validation images**
Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.
- Red rectangles must ONLY cover input areas
- Red rectangles MUST NOT contain any text or labels
- Blue rectangles should contain label text
- For checkboxes:
- Red rectangle MUST be centered on the checkbox square
- Blue rectangle should cover the text label for the checkbox
**If any rectangles look wrong:** Fix bounding boxes in `.scan.json`, then return to Stage 1 (automated check gate). You must pass both stages again.
## Step 4: Convert to PDF Coordinates
Once all validation passes, convert the image coordinates to PDF coordinates:
```bash
python scripts/convert_coordinates.py <basename>.chatfield/<basename>.scan.json <basename>.pdf
```
## Troubleshooting
**Bounding boxes don't align in validation images:**
- Review the validation image carefully
- Adjust coordinates in `.scan.json`
- Remember: You're using IMAGE coordinates (origin at top-left, Y downward)
- Re-run validation after changes
**Text gets cut off:**
- Increase bounding box height and/or width in `.scan.json`
- Entry boxes should have extra space for text
**Validation script errors:**
- Ensure all page images exist in `<basename>.chatfield/`
- Verify JSON syntax in `.scan.json`
- Check that page numbers are 1-indexed
---
**See Also:**
- ../../filling-pdf-forms/references/Converting-PDF-To-Chatfield.md - How the main skill builds the interview
- ./Fillable-Forms.md - Alternative extraction for fillable PDFs
- ../../filling-pdf-forms/references/populating.md - How bounding boxes are used during PDF population

View File

@@ -0,0 +1,78 @@
from dataclasses import dataclass
import json
import sys
# Script to check that bounding boxes in a JSON file do not overlap or have other issues.
# Works with any coordinate system since it only checks geometric relationships.
@dataclass
class RectAndField:
rect: list[float]
rect_type: str
field: dict
# Returns a list of messages that are printed to stdout for Claude to read.
def get_bounding_box_messages(fields_json_stream) -> list[str]:
messages = []
fields = json.load(fields_json_stream)
messages.append(f"Read {len(fields)} fields")
def rects_intersect(r1, r2):
disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
return not (disjoint_horizontal or disjoint_vertical)
rects_and_fields = []
for f in fields:
# Skip empty label rects (used for fields without labels)
label_rect = f.get('label_rect', [0, 0, 0, 0])
if label_rect != [0, 0, 0, 0]:
rects_and_fields.append(RectAndField(label_rect, "label", f))
rects_and_fields.append(RectAndField(f['rect'], "entry", f))
has_error = False
for i, ri in enumerate(rects_and_fields):
# This is O(N^2); we can optimize if it becomes a problem.
for j in range(i + 1, len(rects_and_fields)):
rj = rects_and_fields[j]
if ri.field['page'] == rj.field['page'] and rects_intersect(ri.rect, rj.rect):
has_error = True
if ri.field is rj.field:
messages.append(f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['field_id']}` ({ri.rect}, {rj.rect})")
else:
messages.append(f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['field_id']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['field_id']}` ({rj.rect})")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if ri.rect_type == "entry":
if "entry_text" in ri.field:
font_size = ri.field["entry_text"].get("font_size", 14)
entry_height = ri.rect[3] - ri.rect[1]
if entry_height < font_size:
has_error = True
messages.append(f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['field_id']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size.")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if not has_error:
messages.append("SUCCESS: All bounding boxes are valid")
return messages
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: check_bounding_boxes.py [fields.json or scan.json]")
print()
print("Examples:")
print(" python check_bounding_boxes.py form.chatfield/form.scan.json")
print(" python check_bounding_boxes.py form.chatfield/form.form.json")
sys.exit(1)
# Input file can be .scan.json (image coords) or .form.json (PDF coords)
# The geometry checks work the same either way
with open(sys.argv[1]) as f:
messages = get_bounding_box_messages(f)
for msg in messages:
print(msg)

View File

@@ -0,0 +1,226 @@
import unittest
import json
import io
from check_bounding_boxes import get_bounding_box_messages
# Currently this is not run automatically in CI; it's just for documentation and manual checking.
class TestGetBoundingBoxMessages(unittest.TestCase):
def create_json_stream(self, data):
"""Helper to create a JSON stream from data"""
return io.StringIO(json.dumps(data))
def test_no_intersections(self):
"""Test case with no bounding box intersections"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [10, 40, 50, 60],
"entry_bounding_box": [60, 40, 150, 60]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_label_entry_intersection_same_field(self):
"""Test intersection between label and entry of the same field"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 60, 30],
"entry_bounding_box": [50, 10, 150, 30] # Overlaps with label
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_intersection_between_different_fields(self):
"""Test intersection between bounding boxes of different fields"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [40, 20, 80, 40], # Overlaps with Name's boxes
"entry_bounding_box": [160, 10, 250, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_different_pages_no_intersection(self):
"""Test that boxes on different pages don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 2,
"label_bounding_box": [10, 10, 50, 30], # Same coordinates but different page
"entry_bounding_box": [60, 10, 150, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_entry_height_too_small(self):
"""Test that entry box height is checked against font size"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {
"font_size": 14 # Font size larger than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_entry_height_adequate(self):
"""Test that adequate entry box height passes"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30], # Height is 20
"entry_text": {
"font_size": 14 # Font size smaller than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_default_font_size(self):
"""Test that default font size is used when not specified"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {} # No font_size specified, should use default 14
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_no_entry_text(self):
"""Test that missing entry_text doesn't cause height check"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20] # Small height but no entry_text
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_multiple_errors_limit(self):
"""Test that error messages are limited to prevent excessive output"""
fields = []
# Create many overlapping fields
for i in range(25):
fields.append({
"description": f"Field{i}",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30], # All overlap
"entry_bounding_box": [20, 15, 60, 35] # All overlap
})
data = {"form_fields": fields}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
# Should abort after ~20 messages
self.assertTrue(any("Aborting" in msg for msg in messages))
# Should have some FAILURE messages but not hundreds
failure_count = sum(1 for msg in messages if "FAILURE" in msg)
self.assertGreater(failure_count, 0)
self.assertLess(len(messages), 30) # Should be limited
def test_edge_touching_boxes(self):
"""Test that boxes touching at edges don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [50, 10, 150, 30] # Touches at x=50
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
if __name__ == '__main__':
unittest.main()

View File

@@ -0,0 +1,12 @@
import sys
from pypdf import PdfReader
# Script for Claude to run to determine whether a PDF has fillable form fields. See forms.md.
reader = PdfReader(sys.argv[1])
if (reader.get_fields()):
print("This PDF has fillable form fields")
else:
print("This PDF does not have fillable form fields")

View File

@@ -0,0 +1,179 @@
#!/usr/bin/env python3
"""
Converts bounding box coordinates from image coordinates to PDF coordinates.
This script takes a .scan.json file (with image coordinates) and converts all
bounding boxes to PDF coordinates, producing a .form.json file.
Image coordinates: Origin at top-left, Y increases downward
PDF coordinates: Origin at bottom-left, Y increases upward
Usage:
python convert_coordinates.py <scan.json> <pdf_file>
"""
import json
import sys
from pathlib import Path
from PIL import Image
from pypdf import PdfReader
def image_to_pdf_coords(image_bbox, image_width, image_height, pdf_width, pdf_height):
"""
Convert bounding box from image coordinates to PDF coordinates.
Args:
image_bbox: [x1, y1, x2, y2] in image coordinates (top-left origin)
image_width: Width of the image in pixels
image_height: Height of the image in pixels
pdf_width: Width of the PDF page in points
pdf_height: Height of the PDF page in points
Returns:
[x1, y1, x2, y2] in PDF coordinates (bottom-left origin)
"""
x_scale = pdf_width / image_width
y_scale = pdf_height / image_height
# Convert X coordinates (simple scaling, same origin)
pdf_x1 = image_bbox[0] * x_scale
pdf_x2 = image_bbox[2] * x_scale
# Convert Y coordinates (flip vertical axis)
# Image: y1 is top, y2 is bottom (y1 < y2 in image coords)
# PDF: need to flip - what's at top of image is high Y in PDF
pdf_y1 = (image_height - image_bbox[3]) * y_scale # Bottom in PDF (was bottom in image)
pdf_y2 = (image_height - image_bbox[1]) * y_scale # Top in PDF (was top in image)
return [pdf_x1, pdf_y1, pdf_x2, pdf_y2]
def get_image_dimensions(images_dir, page_number):
"""Get dimensions of the PNG image for a specific page."""
image_path = Path(images_dir) / f"page_{page_number}.png"
if not image_path.exists():
raise FileNotFoundError(f"Image not found: {image_path}")
with Image.open(image_path) as img:
return img.width, img.height
def convert_scan_to_form(scan_json_path, pdf_path, output_json_path):
"""
Convert .scan.json (image coords) to .form.json (PDF coords).
Args:
scan_json_path: Path to input .scan.json file
pdf_path: Path to the PDF file
output_json_path: Path to output .form.json file
"""
# Load scan data
with open(scan_json_path, 'r') as f:
fields = json.load(f)
# Get PDF dimensions
reader = PdfReader(pdf_path)
# Determine images directory (same directory as scan.json)
scan_path = Path(scan_json_path)
images_dir = scan_path.parent
if not images_dir.exists():
raise FileNotFoundError(
f"Images directory not found: {images_dir}\n"
f"Expected to find page images in {images_dir}"
)
# Convert each field
converted_fields = []
for field in fields:
page_num = field.get('page', 1)
# Get dimensions for this page
page = reader.pages[page_num - 1] # Convert to 0-indexed
pdf_width = float(page.mediabox.width)
pdf_height = float(page.mediabox.height)
image_width, image_height = get_image_dimensions(images_dir, page_num)
# Create converted field
converted_field = field.copy()
# Convert main rect
if 'rect' in field:
converted_field['rect'] = image_to_pdf_coords(
field['rect'],
image_width, image_height,
pdf_width, pdf_height
)
# Convert label_rect if present
if 'label_rect' in field:
converted_field['label_rect'] = image_to_pdf_coords(
field['label_rect'],
image_width, image_height,
pdf_width, pdf_height
)
# Convert radio button options if present
if 'radio_options' in field:
converted_options = []
for option in field['radio_options']:
converted_option = option.copy()
if 'rect' in option:
converted_option['rect'] = image_to_pdf_coords(
option['rect'],
image_width, image_height,
pdf_width, pdf_height
)
converted_options.append(converted_option)
converted_field['radio_options'] = converted_options
converted_fields.append(converted_field)
# Write output
with open(output_json_path, 'w') as f:
json.dump(converted_fields, f, indent=2)
print(f"Converted {len(converted_fields)} fields from image to PDF coordinates")
print(f"Input: {scan_json_path}")
print(f"Output: {output_json_path}")
# Show an example conversion
if converted_fields:
print("\nExample conversion (first field):")
orig = fields[0]
conv = converted_fields[0]
print(f" Field: {orig.get('field_id')}")
print(f" Image rect: {orig.get('rect')}")
print(f" PDF rect: {conv.get('rect')}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_coordinates.py <scan.json> <pdf_file>")
print()
print("Example:")
print(" python convert_coordinates.py my_form.chatfield/my_form.scan.json my_form.pdf")
print()
print("Output filename is automatically computed by replacing .scan.json with .form.json")
sys.exit(1)
scan_json_path = sys.argv[1]
pdf_path = sys.argv[2]
# Compute output filename by replacing .scan.json with .form.json
scan_path = Path(scan_json_path)
if not scan_path.name.endswith('.scan.json'):
print(f"Error: Input file must end with .scan.json, got: {scan_path.name}", file=sys.stderr)
sys.exit(1)
output_json_path = str(scan_path.parent / scan_path.name.replace('.scan.json', '.form.json'))
try:
convert_scan_to_form(scan_json_path, pdf_path, output_json_path)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,35 @@
import os
import sys
from pdf2image import convert_from_path
# Converts each page of a PDF to a PNG image.
def convert(pdf_path, output_dir, max_dim=1000):
images = convert_from_path(pdf_path, dpi=200)
for i, image in enumerate(images):
# Scale image if needed to keep width/height under `max_dim`
width, height = image.size
if width > max_dim or height > max_dim:
scale_factor = min(max_dim / width, max_dim / height)
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = image.resize((new_width, new_height))
image_path = os.path.join(output_dir, f"page_{i+1}.png")
image.save(image_path)
print(f"Saved page {i+1} as {image_path} (size: {image.size})")
print(f"Converted {len(images)} pages to PNG images")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_pdf_to_images.py [input pdf] [output directory]")
sys.exit(1)
pdf_path = sys.argv[1]
output_directory = sys.argv[2]
convert(pdf_path, output_directory)

View File

@@ -0,0 +1,59 @@
import json
import sys
from PIL import Image, ImageDraw
# Creates "validation" images with rectangles for the bounding box information that
# Claude creates when determining where to add text annotations in PDFs.
# This version works with IMAGE coordinates (from .scan.json files).
def create_validation_image(page_number, fields_json_path, input_path, output_path):
"""
Create a validation image with bounding boxes overlaid.
Args:
page_number: Page number (1-indexed)
fields_json_path: Path to .scan.json file (IMAGE coordinates)
input_path: Path to input PNG image
output_path: Path to output validation image
"""
# Input file should be in the .scan.json format with IMAGE coordinates
with open(fields_json_path, 'r') as f:
fields = json.load(f)
img = Image.open(input_path)
draw = ImageDraw.Draw(img)
num_boxes = 0
for field in fields:
if field['page'] == page_number:
# Coordinates are already in image space - use them directly!
entry_box_img = field['rect']
label_box_img = field.get('label_rect', [0, 0, 0, 0])
# Draw red rectangle over entry bounding box
draw.rectangle(entry_box_img, outline='red', width=2)
num_boxes += 1
if label_box_img != [0, 0, 0, 0]:
draw.rectangle(label_box_img, outline='blue', width=2)
num_boxes += 1
img.save(output_path)
print(f"Created validation image at {output_path} with {num_boxes} bounding boxes")
if __name__ == "__main__":
if len(sys.argv) != 5:
print("Usage: create_validation_image.py [page number] [scan.json file] [input image path] [output image path]")
print()
print("Example:")
print(" python create_validation_image.py 1 form.chatfield/form.scan.json form.chatfield/page_1.png form.chatfield/page_1_validation.png")
sys.exit(1)
page_number = int(sys.argv[1])
fields_json_path = sys.argv[2]
input_image_path = sys.argv[3]
output_image_path = sys.argv[4]
create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)

View File

@@ -0,0 +1,158 @@
import json
import sys
from pypdf import PdfReader
# Extracts data for the fillable form fields in a PDF and outputs JSON that
# Claude uses to fill the fields. See forms.md.
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
def get_full_annotation_field_id(annotation):
components = []
while annotation:
field_name = annotation.get('/T')
if field_name:
components.append(field_name)
annotation = annotation.get('/Parent')
return ".".join(reversed(components)) if components else None
def make_field_dict(field, field_id):
field_dict = {"field_id": field_id}
ft = field.get('/FT')
if ft == "/Tx":
field_dict["type"] = "text"
elif ft == "/Btn":
field_dict["type"] = "checkbox" # radio groups handled separately
states = field.get("/_States_", [])
if len(states) == 2:
# "/Off" seems to always be the unchecked value, as suggested by
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
# It can be either first or second in the "/_States_" list.
if "/Off" in states:
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
field_dict["unchecked_value"] = "/Off"
else:
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
field_dict["checked_value"] = states[0]
field_dict["unchecked_value"] = states[1]
elif ft == "/Ch":
field_dict["type"] = "choice"
states = field.get("/_States_", [])
field_dict["choice_options"] = [{
"value": state[0],
"text": state[1],
} for state in states]
else:
field_dict["type"] = f"unknown ({ft})"
# Extract tooltip (TU = tooltip/user-facing text)
tooltip = field.get('/TU')
if tooltip:
field_dict["tooltip"] = tooltip
return field_dict
# Returns a list of fillable PDF fields:
# [
# {
# "field_id": "name",
# "page": 1,
# "type": ("text", "checkbox", "radio_group", or "choice")
# // Per-type additional fields described in forms.md
# },
# ]
def get_field_info(reader: PdfReader):
fields = reader.get_fields()
field_info_by_id = {}
possible_radio_names = set()
for field_id, field in fields.items():
# Skip if this is a container field with children, except that it might be
# a parent group for radio button options.
if field.get("/Kids"):
if field.get("/FT") == "/Btn":
possible_radio_names.add(field_id)
continue
field_info_by_id[field_id] = make_field_dict(field, field_id)
# Bounding rects are stored in annotations in page objects.
# Radio button options have a separate annotation for each choice;
# all choices have the same field name.
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
radio_fields_by_id = {}
for page_index, page in enumerate(reader.pages):
annotations = page.get('/Annots', [])
for ann in annotations:
field_id = get_full_annotation_field_id(ann)
if field_id in field_info_by_id:
field_info_by_id[field_id]["page"] = page_index + 1
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
elif field_id in possible_radio_names:
try:
# ann['/AP']['/N'] should have two items. One of them is '/Off',
# the other is the active value.
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
except KeyError:
continue
if len(on_values) == 1:
rect = ann.get("/Rect")
if field_id not in radio_fields_by_id:
radio_fields_by_id[field_id] = {
"field_id": field_id,
"type": "radio_group",
"page": page_index + 1,
"radio_options": [],
}
# Note: at least on macOS 15.7, Preview.app doesn't show selected
# radio buttons correctly. (It does if you remove the leading slash
# from the value, but that causes them not to appear correctly in
# Chrome/Firefox/Acrobat/etc).
radio_fields_by_id[field_id]["radio_options"].append({
"value": on_values[0],
"rect": rect,
})
# Some PDFs have form field definitions without corresponding annotations,
# so we can't tell where they are. Ignore these fields for now.
fields_with_location = []
for field_info in field_info_by_id.values():
if "page" in field_info:
fields_with_location.append(field_info)
else:
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
def sort_key(f):
if "radio_options" in f:
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
else:
rect = f.get("rect") or [0, 0, 0, 0]
adjusted_position = [-rect[1], rect[0]]
return [f.get("page"), adjusted_position]
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
sorted_fields.sort(key=sort_key)
return sorted_fields
def write_field_info(pdf_path: str, json_output_path: str):
reader = PdfReader(pdf_path)
field_info = get_field_info(reader)
with open(json_output_path, "w") as f:
json.dump(field_info, f, indent=2)
print(f"Wrote {len(field_info)} fields to {json_output_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: extract_form_field_info.py [input pdf] [output json]")
sys.exit(1)
write_field_info(sys.argv[1], sys.argv[2])