Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:25:58 +08:00
commit e13b6ff259
31 changed files with 3185 additions and 0 deletions

View File

@@ -0,0 +1,2 @@
*.png
*.plantuml

View File

@@ -0,0 +1,117 @@
---
name: extracting-form-fields
description: Extract form field data from PDFs as a first step to filling PDF forms
allowed-tools: Read, Write, Edit, Glob, Bash
version: 1.0.0a2
license: Apache 2.0
---
# Extracting Form Fields
Prepare working directory and extract field data from PDF forms.
<purpose>
This skill extracts PDF form information into useful JSON.
- Detects fillable vs. non-fillable PDFs
- Extracts PDF content as readable Markdown
- Creates field metadata in common JSON format
</purpose>
## Inputs
- **PDF path**: Path to PDF file (e.g., `/home/user/input.pdf`)
## Process Overview
```plantuml
@startuml SKILL
title Extracting Form Fields - High-Level Workflow
start
:Create working directory;
:Copy interview template;
:Extract PDF content as Markdown;
:Check Fillability;
if (PDF has fillable fields?) then (yes)
:Fillable workflow
(see Fillable-Forms.md);
else (no)
:Non-fillable workflow
(see Nonfillable-Forms.md);
endif
:**✓ EXTRACTION COMPLETE**;
:Ready for Form Data Model creation;
stop
@enduml
```
## Process
### 1. Create Working Directory
```bash
mkdir <basename>.chatfield
```
### 2. Copy Interview Template
Copy a file from the included `filling-pdf-forms` skill's template. The example path below is relative to this skill directory.
```bash
cp ../filling-pdf-forms/scripts/chatfield_interview_template.py <basename>.chatfield/interview.py
```
### 3. Extract PDF Content
```bash
markitdown <pdf_path> > <basename>.chatfield/<basename>.form.md
```
### 4. Check Fillability
```bash
python scripts/check_fillable_fields.py <pdf_path>
```
**Output:**
- `"This PDF has fillable form fields"` → use fillable workflow
- `"This PDF does not have fillable form fields"` → use non-fillable workflow
### 5. Branch Based on Fillability
#### If Fillable:
Follow ./references/Fillable-Forms.md
#### If Non-fillable:
Follow ./references/Nonfillable-Forms.md
## Output Format
### Fillable PDFs - .form.json
```json
[
{
"field_id": "topmostSubform[0].Page1[0].f1_01[0]",
"type": "text",
"page": 1,
"rect": [100, 200, 300, 220],
"tooltip": "Enter your full legal name",
"max_length": null
},
{
"field_id": "checkbox_over_18",
"type": "checkbox",
"page": 1,
"rect": [150, 250, 165, 265],
"checked_value": "/1",
"unchecked_value": "/Off"
}
]
```
## References
- ./references/Fillable-Forms.md - Fillable PDF extraction workflow
- ./references/Nonfillable-Forms.md - Non-fillable PDF extraction workflow

View File

@@ -0,0 +1,29 @@
# Fillable PDF Forms - Extraction Guide
This guide is for the "extracting-form-fields" agent performing extraction on fillable PDFs.
## Process Overview
```plantuml
@startuml Fillable-Forms
title Fillable PDF Forms - Extraction Workflow
start
:Extract form field metadata;
:**✓ FILLABLE EXTRACTION COMPLETE**;
stop
@enduml
```
## Extraction Process
### 1. Extract Form Field Metadata
```bash
python scripts/extract_form_field_info.py input.pdf input.chatfield/input.form.json
```
This creates a JSON file with field metadata:
## Completion Report
After extraction, simply state "Done". If there is an unrecoverable error, halt and report the error verbatim.

View File

@@ -0,0 +1,218 @@
# Non-fillable PDF Forms - Extraction Guide
You'll need to visually determine where the data should be added as text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.
- Convert the PDF to PNG images and determine field bounding boxes.
- Create a JSON file with field information and validation images showing the bounding boxes.
- Validate the the bounding boxes.
## Process Overview
```plantuml
@startuml Nonfillable-Forms
title Non-fillable PDF Forms - Extraction Workflow
start
:Convert PDF to PNG images;
:Visual analysis & determine bounding boxes
in IMAGE coordinates;
:Create .scan.json;
repeat
:Automated intersection check
on image coordinates;
if (Automated check passes?) then (yes)
:Create validation images
(overlay on PNGs);
:Manual image inspection;
if (Manual check passes?) then (yes)
else (no)
:Fix bounding boxes in .scan.json;
endif
else (no)
:Fix bounding boxes in .scan.json;
endif
repeat while (Both checks pass?) is (no)
->yes;
:Convert coordinates
(.scan.json → .form.json);
:**✓ NON-FILLABLE EXTRACTION COMPLETE**
.form.json ready with PDF coordinates;
stop
@enduml
```
## Extraction Process
## Step 1: Visual Analysis (REQUIRED)
- Convert the PDF to PNG images. Run this script from this skill's directory:
```bash
python scripts/convert_pdf_to_images.py <basename>.pdf <basename>.chatfield/
```
The script will create a PNG image for each page.
- Read and analyze the the .form.md file which is a Markdown text preview of the PDF content
- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter information, determine bounding boxes, in the image coordinate system, for both the field label and the input entry area. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.
These are some examples of form structures that you might see (in English, but the form can be any language):
*Label inside box*
```
┌────────────────────────┐
│ Name: │
└────────────────────────┘
```
The input area should be to the right of the "Name" label and extend to the edge of the box.
*Label before line*
```
Email: _______________________
```
The input area should be above the line and include its entire width.
*Label under line*
```
_________________________
Name
```
The input area should be above the line and include the entire width of the line. This is common for signature and date fields.
*Label above line*
```
Please enter any special requests:
________________________________________________
```
The input area should extend from the bottom of the label to the line, and should include the entire width of the line.
*Checkboxes*
```
Are you a US citizen? Yes □ No □
```
For checkboxes:
- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
- Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
- The entry bounding box should cover ONLY the small square, not the text label.
## Step 2: Create .scan.json
Create `<basename>.chatfield/<basename>.scan.json` formatted like the below example. Rectangle values are **IMAGE coordinates** (what you see directly in the PNG, top-left origin).
```json
[
{
"field_id": "full_name",
"type": "text",
"page": 1,
"rect": [180, 200, 550, 220],
"label_text": "Full Name:",
"label_rect": [50, 200, 175, 220]
},
{
"field_id": "is_citizen",
"type": "checkbox",
"page": 1,
"rect": [60, 320, 75, 335],
"label_text": "US Citizen",
"label_rect": [80, 320, 150, 335],
"checked_value": "X",
"unchecked_value": ""
}
]
```
**Field structure:**
- `field_id` - Unique identifier (will be used in chatfield definition)
- **CRITICAL:** Every field MUST have a unique field_id with no collisions
- Field IDs are internal identifiers, not user-facing
- `type` - "text" or "checkbox"
- `page` - Page number (1-indexed)
- `rect` - Entry area bounding box [x1, y1, x2, y2] where data will be written
- `label_text` - Optional label text for this field
- `label_rect` - Optional label bounding box [x1, y1, x2, y2]
- For checkboxes only:
- `checked_value` - String to write when checked (typically "X" or "✓")
- `unchecked_value` - String to write when unchecked (typically "")
**Bounding box coordinates (IMAGE COORDINATES):**
- Image coordinate system: Origin (0,0) at top-left
- X increases to the right, Y increases downward
- Format: `[x1, y1, x2, y2]` where (x1,y1) is top-left corner, (x2,y2) is bottom-right corner
- These are the pixel coordinates you see directly in the PNG image
- Entry boxes (`rect`) must be tall and wide enough to contain text
- Label boxes (`label_rect`) should contain the label text
- Entry and label boxes MUST NOT overlap
- Checkboxes should be at least 10-20 pixels square
## Step 3: Validate Bounding Boxes (REQUIRED)
This is a two-stage validation process. You must pass the automated check before proceeding to manual inspection.
### Stage 1: Automated intersection check
Run the automated check script:
```bash
python scripts/check_bounding_boxes.py <basename>.chatfield/<basename>.scan.json
```
**What it checks:**
- Label/entry bounding box intersections (must not overlap)
- Boxes too small to contain text
- Missing required fields
**If there are errors:** Fix the bounding boxes in `.scan.json` and re-run the automated check. Iterate until there are no remaining errors.
**Only proceed to Stage 2 once all automated checks pass.**
### Stage 2: Manual image inspection
Create validation images for each page:
```bash
# For each page (e.g., if you have 3 pages)
python scripts/create_validation_image.py 1 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_1.png <basename>.chatfield/page_1_validation.png
python scripts/create_validation_image.py 2 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_2.png <basename>.chatfield/page_2_validation.png
python scripts/create_validation_image.py 3 <basename>.chatfield/<basename>.scan.json <basename>.chatfield/page_3.png <basename>.chatfield/page_3_validation.png
```
This overlays colored rectangles (red for entry boxes, blue for labels) on the PNG images to visualize bounding boxes.
**CRITICAL: Visually inspect validation images**
Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.
- Red rectangles must ONLY cover input areas
- Red rectangles MUST NOT contain any text or labels
- Blue rectangles should contain label text
- For checkboxes:
- Red rectangle MUST be centered on the checkbox square
- Blue rectangle should cover the text label for the checkbox
**If any rectangles look wrong:** Fix bounding boxes in `.scan.json`, then return to Stage 1 (automated check gate). You must pass both stages again.
## Step 4: Convert to PDF Coordinates
Once all validation passes, convert the image coordinates to PDF coordinates:
```bash
python scripts/convert_coordinates.py <basename>.chatfield/<basename>.scan.json <basename>.pdf
```
## Troubleshooting
**Bounding boxes don't align in validation images:**
- Review the validation image carefully
- Adjust coordinates in `.scan.json`
- Remember: You're using IMAGE coordinates (origin at top-left, Y downward)
- Re-run validation after changes
**Text gets cut off:**
- Increase bounding box height and/or width in `.scan.json`
- Entry boxes should have extra space for text
**Validation script errors:**
- Ensure all page images exist in `<basename>.chatfield/`
- Verify JSON syntax in `.scan.json`
- Check that page numbers are 1-indexed
---
**See Also:**
- ../../filling-pdf-forms/references/Converting-PDF-To-Chatfield.md - How the main skill builds the interview
- ./Fillable-Forms.md - Alternative extraction for fillable PDFs
- ../../filling-pdf-forms/references/populating.md - How bounding boxes are used during PDF population

View File

@@ -0,0 +1,78 @@
from dataclasses import dataclass
import json
import sys
# Script to check that bounding boxes in a JSON file do not overlap or have other issues.
# Works with any coordinate system since it only checks geometric relationships.
@dataclass
class RectAndField:
rect: list[float]
rect_type: str
field: dict
# Returns a list of messages that are printed to stdout for Claude to read.
def get_bounding_box_messages(fields_json_stream) -> list[str]:
messages = []
fields = json.load(fields_json_stream)
messages.append(f"Read {len(fields)} fields")
def rects_intersect(r1, r2):
disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
return not (disjoint_horizontal or disjoint_vertical)
rects_and_fields = []
for f in fields:
# Skip empty label rects (used for fields without labels)
label_rect = f.get('label_rect', [0, 0, 0, 0])
if label_rect != [0, 0, 0, 0]:
rects_and_fields.append(RectAndField(label_rect, "label", f))
rects_and_fields.append(RectAndField(f['rect'], "entry", f))
has_error = False
for i, ri in enumerate(rects_and_fields):
# This is O(N^2); we can optimize if it becomes a problem.
for j in range(i + 1, len(rects_and_fields)):
rj = rects_and_fields[j]
if ri.field['page'] == rj.field['page'] and rects_intersect(ri.rect, rj.rect):
has_error = True
if ri.field is rj.field:
messages.append(f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['field_id']}` ({ri.rect}, {rj.rect})")
else:
messages.append(f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['field_id']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['field_id']}` ({rj.rect})")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if ri.rect_type == "entry":
if "entry_text" in ri.field:
font_size = ri.field["entry_text"].get("font_size", 14)
entry_height = ri.rect[3] - ri.rect[1]
if entry_height < font_size:
has_error = True
messages.append(f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['field_id']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size.")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if not has_error:
messages.append("SUCCESS: All bounding boxes are valid")
return messages
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: check_bounding_boxes.py [fields.json or scan.json]")
print()
print("Examples:")
print(" python check_bounding_boxes.py form.chatfield/form.scan.json")
print(" python check_bounding_boxes.py form.chatfield/form.form.json")
sys.exit(1)
# Input file can be .scan.json (image coords) or .form.json (PDF coords)
# The geometry checks work the same either way
with open(sys.argv[1]) as f:
messages = get_bounding_box_messages(f)
for msg in messages:
print(msg)

View File

@@ -0,0 +1,226 @@
import unittest
import json
import io
from check_bounding_boxes import get_bounding_box_messages
# Currently this is not run automatically in CI; it's just for documentation and manual checking.
class TestGetBoundingBoxMessages(unittest.TestCase):
def create_json_stream(self, data):
"""Helper to create a JSON stream from data"""
return io.StringIO(json.dumps(data))
def test_no_intersections(self):
"""Test case with no bounding box intersections"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [10, 40, 50, 60],
"entry_bounding_box": [60, 40, 150, 60]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_label_entry_intersection_same_field(self):
"""Test intersection between label and entry of the same field"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 60, 30],
"entry_bounding_box": [50, 10, 150, 30] # Overlaps with label
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_intersection_between_different_fields(self):
"""Test intersection between bounding boxes of different fields"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [40, 20, 80, 40], # Overlaps with Name's boxes
"entry_bounding_box": [160, 10, 250, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_different_pages_no_intersection(self):
"""Test that boxes on different pages don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 2,
"label_bounding_box": [10, 10, 50, 30], # Same coordinates but different page
"entry_bounding_box": [60, 10, 150, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_entry_height_too_small(self):
"""Test that entry box height is checked against font size"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {
"font_size": 14 # Font size larger than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_entry_height_adequate(self):
"""Test that adequate entry box height passes"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30], # Height is 20
"entry_text": {
"font_size": 14 # Font size smaller than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_default_font_size(self):
"""Test that default font size is used when not specified"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {} # No font_size specified, should use default 14
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_no_entry_text(self):
"""Test that missing entry_text doesn't cause height check"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20] # Small height but no entry_text
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_multiple_errors_limit(self):
"""Test that error messages are limited to prevent excessive output"""
fields = []
# Create many overlapping fields
for i in range(25):
fields.append({
"description": f"Field{i}",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30], # All overlap
"entry_bounding_box": [20, 15, 60, 35] # All overlap
})
data = {"form_fields": fields}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
# Should abort after ~20 messages
self.assertTrue(any("Aborting" in msg for msg in messages))
# Should have some FAILURE messages but not hundreds
failure_count = sum(1 for msg in messages if "FAILURE" in msg)
self.assertGreater(failure_count, 0)
self.assertLess(len(messages), 30) # Should be limited
def test_edge_touching_boxes(self):
"""Test that boxes touching at edges don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [50, 10, 150, 30] # Touches at x=50
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
if __name__ == '__main__':
unittest.main()

View File

@@ -0,0 +1,12 @@
import sys
from pypdf import PdfReader
# Script for Claude to run to determine whether a PDF has fillable form fields. See forms.md.
reader = PdfReader(sys.argv[1])
if (reader.get_fields()):
print("This PDF has fillable form fields")
else:
print("This PDF does not have fillable form fields")

View File

@@ -0,0 +1,179 @@
#!/usr/bin/env python3
"""
Converts bounding box coordinates from image coordinates to PDF coordinates.
This script takes a .scan.json file (with image coordinates) and converts all
bounding boxes to PDF coordinates, producing a .form.json file.
Image coordinates: Origin at top-left, Y increases downward
PDF coordinates: Origin at bottom-left, Y increases upward
Usage:
python convert_coordinates.py <scan.json> <pdf_file>
"""
import json
import sys
from pathlib import Path
from PIL import Image
from pypdf import PdfReader
def image_to_pdf_coords(image_bbox, image_width, image_height, pdf_width, pdf_height):
"""
Convert bounding box from image coordinates to PDF coordinates.
Args:
image_bbox: [x1, y1, x2, y2] in image coordinates (top-left origin)
image_width: Width of the image in pixels
image_height: Height of the image in pixels
pdf_width: Width of the PDF page in points
pdf_height: Height of the PDF page in points
Returns:
[x1, y1, x2, y2] in PDF coordinates (bottom-left origin)
"""
x_scale = pdf_width / image_width
y_scale = pdf_height / image_height
# Convert X coordinates (simple scaling, same origin)
pdf_x1 = image_bbox[0] * x_scale
pdf_x2 = image_bbox[2] * x_scale
# Convert Y coordinates (flip vertical axis)
# Image: y1 is top, y2 is bottom (y1 < y2 in image coords)
# PDF: need to flip - what's at top of image is high Y in PDF
pdf_y1 = (image_height - image_bbox[3]) * y_scale # Bottom in PDF (was bottom in image)
pdf_y2 = (image_height - image_bbox[1]) * y_scale # Top in PDF (was top in image)
return [pdf_x1, pdf_y1, pdf_x2, pdf_y2]
def get_image_dimensions(images_dir, page_number):
"""Get dimensions of the PNG image for a specific page."""
image_path = Path(images_dir) / f"page_{page_number}.png"
if not image_path.exists():
raise FileNotFoundError(f"Image not found: {image_path}")
with Image.open(image_path) as img:
return img.width, img.height
def convert_scan_to_form(scan_json_path, pdf_path, output_json_path):
"""
Convert .scan.json (image coords) to .form.json (PDF coords).
Args:
scan_json_path: Path to input .scan.json file
pdf_path: Path to the PDF file
output_json_path: Path to output .form.json file
"""
# Load scan data
with open(scan_json_path, 'r') as f:
fields = json.load(f)
# Get PDF dimensions
reader = PdfReader(pdf_path)
# Determine images directory (same directory as scan.json)
scan_path = Path(scan_json_path)
images_dir = scan_path.parent
if not images_dir.exists():
raise FileNotFoundError(
f"Images directory not found: {images_dir}\n"
f"Expected to find page images in {images_dir}"
)
# Convert each field
converted_fields = []
for field in fields:
page_num = field.get('page', 1)
# Get dimensions for this page
page = reader.pages[page_num - 1] # Convert to 0-indexed
pdf_width = float(page.mediabox.width)
pdf_height = float(page.mediabox.height)
image_width, image_height = get_image_dimensions(images_dir, page_num)
# Create converted field
converted_field = field.copy()
# Convert main rect
if 'rect' in field:
converted_field['rect'] = image_to_pdf_coords(
field['rect'],
image_width, image_height,
pdf_width, pdf_height
)
# Convert label_rect if present
if 'label_rect' in field:
converted_field['label_rect'] = image_to_pdf_coords(
field['label_rect'],
image_width, image_height,
pdf_width, pdf_height
)
# Convert radio button options if present
if 'radio_options' in field:
converted_options = []
for option in field['radio_options']:
converted_option = option.copy()
if 'rect' in option:
converted_option['rect'] = image_to_pdf_coords(
option['rect'],
image_width, image_height,
pdf_width, pdf_height
)
converted_options.append(converted_option)
converted_field['radio_options'] = converted_options
converted_fields.append(converted_field)
# Write output
with open(output_json_path, 'w') as f:
json.dump(converted_fields, f, indent=2)
print(f"Converted {len(converted_fields)} fields from image to PDF coordinates")
print(f"Input: {scan_json_path}")
print(f"Output: {output_json_path}")
# Show an example conversion
if converted_fields:
print("\nExample conversion (first field):")
orig = fields[0]
conv = converted_fields[0]
print(f" Field: {orig.get('field_id')}")
print(f" Image rect: {orig.get('rect')}")
print(f" PDF rect: {conv.get('rect')}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_coordinates.py <scan.json> <pdf_file>")
print()
print("Example:")
print(" python convert_coordinates.py my_form.chatfield/my_form.scan.json my_form.pdf")
print()
print("Output filename is automatically computed by replacing .scan.json with .form.json")
sys.exit(1)
scan_json_path = sys.argv[1]
pdf_path = sys.argv[2]
# Compute output filename by replacing .scan.json with .form.json
scan_path = Path(scan_json_path)
if not scan_path.name.endswith('.scan.json'):
print(f"Error: Input file must end with .scan.json, got: {scan_path.name}", file=sys.stderr)
sys.exit(1)
output_json_path = str(scan_path.parent / scan_path.name.replace('.scan.json', '.form.json'))
try:
convert_scan_to_form(scan_json_path, pdf_path, output_json_path)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,35 @@
import os
import sys
from pdf2image import convert_from_path
# Converts each page of a PDF to a PNG image.
def convert(pdf_path, output_dir, max_dim=1000):
images = convert_from_path(pdf_path, dpi=200)
for i, image in enumerate(images):
# Scale image if needed to keep width/height under `max_dim`
width, height = image.size
if width > max_dim or height > max_dim:
scale_factor = min(max_dim / width, max_dim / height)
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = image.resize((new_width, new_height))
image_path = os.path.join(output_dir, f"page_{i+1}.png")
image.save(image_path)
print(f"Saved page {i+1} as {image_path} (size: {image.size})")
print(f"Converted {len(images)} pages to PNG images")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_pdf_to_images.py [input pdf] [output directory]")
sys.exit(1)
pdf_path = sys.argv[1]
output_directory = sys.argv[2]
convert(pdf_path, output_directory)

View File

@@ -0,0 +1,59 @@
import json
import sys
from PIL import Image, ImageDraw
# Creates "validation" images with rectangles for the bounding box information that
# Claude creates when determining where to add text annotations in PDFs.
# This version works with IMAGE coordinates (from .scan.json files).
def create_validation_image(page_number, fields_json_path, input_path, output_path):
"""
Create a validation image with bounding boxes overlaid.
Args:
page_number: Page number (1-indexed)
fields_json_path: Path to .scan.json file (IMAGE coordinates)
input_path: Path to input PNG image
output_path: Path to output validation image
"""
# Input file should be in the .scan.json format with IMAGE coordinates
with open(fields_json_path, 'r') as f:
fields = json.load(f)
img = Image.open(input_path)
draw = ImageDraw.Draw(img)
num_boxes = 0
for field in fields:
if field['page'] == page_number:
# Coordinates are already in image space - use them directly!
entry_box_img = field['rect']
label_box_img = field.get('label_rect', [0, 0, 0, 0])
# Draw red rectangle over entry bounding box
draw.rectangle(entry_box_img, outline='red', width=2)
num_boxes += 1
if label_box_img != [0, 0, 0, 0]:
draw.rectangle(label_box_img, outline='blue', width=2)
num_boxes += 1
img.save(output_path)
print(f"Created validation image at {output_path} with {num_boxes} bounding boxes")
if __name__ == "__main__":
if len(sys.argv) != 5:
print("Usage: create_validation_image.py [page number] [scan.json file] [input image path] [output image path]")
print()
print("Example:")
print(" python create_validation_image.py 1 form.chatfield/form.scan.json form.chatfield/page_1.png form.chatfield/page_1_validation.png")
sys.exit(1)
page_number = int(sys.argv[1])
fields_json_path = sys.argv[2]
input_image_path = sys.argv[3]
output_image_path = sys.argv[4]
create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)

View File

@@ -0,0 +1,158 @@
import json
import sys
from pypdf import PdfReader
# Extracts data for the fillable form fields in a PDF and outputs JSON that
# Claude uses to fill the fields. See forms.md.
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
def get_full_annotation_field_id(annotation):
components = []
while annotation:
field_name = annotation.get('/T')
if field_name:
components.append(field_name)
annotation = annotation.get('/Parent')
return ".".join(reversed(components)) if components else None
def make_field_dict(field, field_id):
field_dict = {"field_id": field_id}
ft = field.get('/FT')
if ft == "/Tx":
field_dict["type"] = "text"
elif ft == "/Btn":
field_dict["type"] = "checkbox" # radio groups handled separately
states = field.get("/_States_", [])
if len(states) == 2:
# "/Off" seems to always be the unchecked value, as suggested by
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
# It can be either first or second in the "/_States_" list.
if "/Off" in states:
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
field_dict["unchecked_value"] = "/Off"
else:
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
field_dict["checked_value"] = states[0]
field_dict["unchecked_value"] = states[1]
elif ft == "/Ch":
field_dict["type"] = "choice"
states = field.get("/_States_", [])
field_dict["choice_options"] = [{
"value": state[0],
"text": state[1],
} for state in states]
else:
field_dict["type"] = f"unknown ({ft})"
# Extract tooltip (TU = tooltip/user-facing text)
tooltip = field.get('/TU')
if tooltip:
field_dict["tooltip"] = tooltip
return field_dict
# Returns a list of fillable PDF fields:
# [
# {
# "field_id": "name",
# "page": 1,
# "type": ("text", "checkbox", "radio_group", or "choice")
# // Per-type additional fields described in forms.md
# },
# ]
def get_field_info(reader: PdfReader):
fields = reader.get_fields()
field_info_by_id = {}
possible_radio_names = set()
for field_id, field in fields.items():
# Skip if this is a container field with children, except that it might be
# a parent group for radio button options.
if field.get("/Kids"):
if field.get("/FT") == "/Btn":
possible_radio_names.add(field_id)
continue
field_info_by_id[field_id] = make_field_dict(field, field_id)
# Bounding rects are stored in annotations in page objects.
# Radio button options have a separate annotation for each choice;
# all choices have the same field name.
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
radio_fields_by_id = {}
for page_index, page in enumerate(reader.pages):
annotations = page.get('/Annots', [])
for ann in annotations:
field_id = get_full_annotation_field_id(ann)
if field_id in field_info_by_id:
field_info_by_id[field_id]["page"] = page_index + 1
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
elif field_id in possible_radio_names:
try:
# ann['/AP']['/N'] should have two items. One of them is '/Off',
# the other is the active value.
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
except KeyError:
continue
if len(on_values) == 1:
rect = ann.get("/Rect")
if field_id not in radio_fields_by_id:
radio_fields_by_id[field_id] = {
"field_id": field_id,
"type": "radio_group",
"page": page_index + 1,
"radio_options": [],
}
# Note: at least on macOS 15.7, Preview.app doesn't show selected
# radio buttons correctly. (It does if you remove the leading slash
# from the value, but that causes them not to appear correctly in
# Chrome/Firefox/Acrobat/etc).
radio_fields_by_id[field_id]["radio_options"].append({
"value": on_values[0],
"rect": rect,
})
# Some PDFs have form field definitions without corresponding annotations,
# so we can't tell where they are. Ignore these fields for now.
fields_with_location = []
for field_info in field_info_by_id.values():
if "page" in field_info:
fields_with_location.append(field_info)
else:
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
def sort_key(f):
if "radio_options" in f:
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
else:
rect = f.get("rect") or [0, 0, 0, 0]
adjusted_position = [-rect[1], rect[0]]
return [f.get("page"), adjusted_position]
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
sorted_fields.sort(key=sort_key)
return sorted_fields
def write_field_info(pdf_path: str, json_output_path: str):
reader = PdfReader(pdf_path)
field_info = get_field_info(reader)
with open(json_output_path, "w") as f:
json.dump(field_info, f, indent=2)
print(f"Wrote {len(field_info)} fields to {json_output_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: extract_form_field_info.py [input pdf] [output json]")
sys.exit(1)
write_field_info(sys.argv[1], sys.argv[2])

2
skills/filling-pdf-forms/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
*.png
*.plantuml

View File

@@ -0,0 +1,30 @@
© 2025 Anthropic, PBC. All rights reserved.
LICENSE: Use of these materials (including all code, prompts, assets, files,
and other components of this Skill) is governed by your agreement with
Anthropic regarding use of Anthropic's services. If no separate agreement
exists, use is governed by Anthropic's Consumer Terms of Service or
Commercial Terms of Service, as applicable:
https://www.anthropic.com/legal/consumer-terms
https://www.anthropic.com/legal/commercial-terms
Your applicable agreement is referred to as the "Agreement." "Services" are
as defined in the Agreement.
ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the
contrary, users may not:
- Extract these materials from the Services or retain copies of these
materials outside the Services
- Reproduce or copy these materials, except for temporary copies created
automatically during authorized use of the Services
- Create derivative works based on these materials
- Distribute, sublicense, or transfer these materials to any third party
- Make, offer to sell, sell, or import any inventions embodied in these
materials
- Reverse engineer, decompile, or disassemble these materials
The receipt, viewing, or possession of these materials does not convey or
imply any license or right beyond those expressly granted above.
Anthropic retains all right, title, and interest in these materials,
including all copyrights, patents, and other intellectual property rights.

View File

@@ -0,0 +1,134 @@
---
name: filling-pdf-forms
description: Complete PDF forms by collecting data through conversational interviews and populating form fields. Use when filling forms, completing documents, or when the user mentions PDFs, forms, form completion, or document population.
allowed-tools: Read, Write, Edit, Glob, Bash, Task
version: 1.0.0a2
license: Apache 2.0
---
# Filling PDF Forms
Complete PDF forms by collecting required data through conversational interviews and populating form fields.
<purpose>
Use when completing PDF forms with user-provided data. Goal: produce `.done.pdf` populated with user information by following this process exactly.
</purpose>
## Process Overview
```plantuml
@startuml SKILL
title Filling PDF Forms - High-Level Workflow
|User|
start
:User provides PDF form to complete;
|filling-pdf-forms skill|
:Step 0: Initialize Chatfield;
:Step 1: Form Extraction;
:Step 2: Build Form Data Model;
:Step 3: Translation Decision;
if (User language is form language?) then (yes)
:Use base Form Data Model;
else (no)
:Translation Setup;
endif
:Step 4: Run Interview Loop;
partition "Chatfield CLI Interview Loop" {
:Initialize: Run CLI without message;
repeat
:CLI outputs question to stdout;
:Present question to user via AskUserQuestion();
|User|
:User provides response;
|filling-pdf-forms skill|
:Run CLI with user's message;
repeat while (CLI indicates complete?) is (no)
->yes;
}
:Inspect collected data via CLI --inspect;
:Step 5: Populate PDF;
if (Fillable form?) then (yes)
:Populate fillable fields
(see Populating-Fillable.md);
else (no)
:Populate non-fillable fields
(see Populating-Nonfillable.md);
endif
|User|
:**✓ SUCCESS**;
:Receive completed PDF <basename>.done.pdf;
stop
@enduml
```
## Workflow
### Step 0: Initialize Chatfield
Test: `python -c "import pypdf; import pdf2image; import markitdown; import chatfield"`.
Install via `pip` if needed; exceptions:
- `markitdown``pip install "markitdown[pdf]"`
- `chatfield``pip install ./scripts/chatfield-1.0.0a2-py3-none-any.whl` (relative to this .md)
### Step 1: Form Extraction
Extract PDF form using `extracting-form-fields` sub-agent:
```python
Task(
subagent_type="general-purpose",
description="Extract PDF form fields",
prompt=f"Extract form field data from PDF: {pdf_path}\n\nUse the extracting-form-fields skill."
)
```
**Task reports**: "fillable" or "non-fillable" (needed for Step 5)
**Creates** (for `input.pdf`):
- `input.chatfield/input.form.md` - PDF as Markdown
- `input.chatfield/input.form.json` - Field definitions
- `input.chatfield/interview.py` - Template Form Data Model
### Step 2: Build Form Data Model
1. Read entirely: `./references/Data-Model-API.md` - Learn Chatfield API
2. Read entirely: `./references/Converting-PDF-To-Chatfield.md` - PDF→Chatfield Form Data Model guidance
3. Edit `<basename>.chatfield/interview.py` - Define Form Data Model
**Result**: The **Form Data Model**, a faithful representation of PDF form using Chatfield API.
### Step 3: Translation
Determine if translation is needed. Translation is needed either:
- **Explicit**: User states "I need to fill this Spanish form but I only speak English"
- **Implicit**: User request is in language X, but PDF is in language Y
Example: "Help me complete form.es.pdf" (English request, Spanish form)
State to the user whether you will translate. Either:
- Claude: This form uses <common-language>
- or Claude: This form uses <form-language> so I will set up translation to <user-language>
**To apply translation, see:** ./references/Translating.md
Translation creates `interview_<lang>.py` and **re-defines** the Form Data Model from `interview.py` to the new `interview_<lang>.py` instead. Henceforth, use the translated file as the Form Data Model.
### Step 4: Run Interview Loop via CLI
**CRITICAL**: See `./references/CLI-Interview-Loop.md` for complete MANDATORY execution rules.
### Step 5: Populate PDF
Parse `--inspect` output and populate the PDF.
#### If Fillable:
**See:** ./references/Populating-Fillable.md
#### If Non-fillable:
**See:** ./references/Populating-Nonfillable.md
**Result**: `<basename>.done.pdf`

View File

@@ -0,0 +1,181 @@
# AskUserQuestion using chatfield.cli strategy
**CRITICAL: Strict adherence required. No deviations permitted.**
This document defines MANDATORY patterns for using `AskUserQuestion` with `chatfield.cli` interviews. Assumes you already know the AskUserQuestion tool signature.
---
## MANDATORY Pattern for EVERY Question
**REQUIRED - EXACT structure:**
```python
AskUserQuestion(
questions=[{
"question": "<chatfield.cli's exact question>", # No paraphrasing
"header": "<12 chars max>",
"multiSelect": <True/False>, # Based on data model
"options": [
# POSITION 1: REQUIRED
{"label": "Skip", "description": "Skip (N/A, blank, negative, etc)"},
# POSITION 2: REQUIRED
{"label": "Delegate", "description": "Ask Claude to look up the needed information using all available resources"},
# POSITION 3: First option from chatfield.cli (if present)
{"label": "<First from chatfield.cli>", "description": "..."},
# POSITION 4: Second option from chatfield.cli (if present)
{"label": "<Second from chatfield.cli>", "description": "..."}
]
}]
)
# POSITION 5 (implicit): "Other" - auto-added for free text
```
---
## Determine multiSelect
**Check `interview.py` Form Data Model (Chatfield builder API):**
| Data Model | multiSelect |
|------------|-------------|
| `.as_multi()` or `.one_or_more()` | `True` |
| `.as_one()` or `.as_nullable_one()` | `False` |
| Plain `.field()` (no cardinality) | `False` |
---
## Parse chatfield.cli Options
**If chatfield.cli output contains options, extract and prioritize:**
**Recognize patterns:**
- `"Status? (Single, Married, Divorced)"`
- `"Choose: A, B, C, D"`
- `"Preference: Red | Blue | Green"`
Add **first TWO** as positions 3-4
**Example:**
```
chatfield.cli: "Status? (Single, Married, Divorced, Widowed)"
Options:
1. Skip
2. Delegate
3. Single ← First from chatfield.cli
4. Married ← Second from chatfield.cli
"Other": User can type "Divorced" or "Widowed"
```
---
## Handle Responses
| Selection | Action |
|-----------|--------|
| Types via "Other" | If starts with `'`: strip prefix and pass verbatim to chatfield.cli. Otherwise: judge if it's a direct answer or instruction to Claude. Direct answer → pass to chatfield.cli; Request for Claude → research/process, then respond to chatfield.cli |
| "Skip" | Context-aware response: Yes/No questions → "No"; Optional/nullable fields → "N/A"; Other fields → "Skip" |
| "Delegate" | Research & provide answer |
| Option 3-4 | Pass selection to CLI |
| Multi-select | Join: "Email, Phone" to chatfield.cli next iteration |
## Distinguishing Direct Answers from Claude Requests
**When user types via "Other", judge intent:**
**Direct answers** (pass to chatfield.cli):
- "Find new customers in new markets" ← answer to "What is your business strategy?"
- "123 Main St, Boston MA" ← answer to "What is your address?"
- "Python and TypeScript" ← answer to "What programming languages?"
**Requests for Claude** (research first):
- "look up my SSN" ← asking Claude to find something
- "research the population" ← asking Claude to look something up
- "what's today's date" ← asking Claude a question
**Edge case:** `'` prefix forces verbatim pass-through regardless of content
---
## Delegation Pattern
**When user selects "Delegate":**
1. Parse question to understand needed info
2. Treat this as if the user directly asked, "Help me find out ..."
2. Use ALL tools available to you,
4. Pass the result to chatfield.cli as if user typed it
5. If not found, ask user
---
## Quick Examples (RULES 1-7)
**Note:** Skip handling is context-aware per "Handle Responses" table above.
### RULE 1: Free Text
```
# chatfield.cli: "What is your name?"
# multiSelect: False
# Options: Skip, Delegate
```
### RULE 2: Yes/No
```
# chatfield.cli: "Are you employed?"
# multiSelect: False
# Options: Skip, Delegate, Yes, No
```
### RULE 3: Single-Select Choice
```
# chatfield.cli: "Status? (Single, Married, Divorced, Widowed)"
# multiSelect: False
# Extract: ["Single", "Married", "Divorced", "Widowed"]
# Options: Skip, Delegate, Single, Married
# Via Other: "Divorced", "Widowed"
```
### RULE 4: Multi-Select Choice
```
# chatfield.cli: "Contact? (Email, Phone, Text, Mail)"
# Data model: .as_multi(...)
# multiSelect: True
# Extract: ["Email", "Phone", "Text", "Mail"]
# Options: Skip, Delegate, Email, Phone
# Via Other: "Text", "Mail"
```
### RULE 5: Numeric
```
# chatfield.cli: "How many dependents?"
# multiSelect: False
# Options: Skip, Delegate (optionally: "0", "1-2")
# Via Other: Exact number
```
### RULE 6: Complex/Address
```
# chatfield.cli: "Mailing address?"
# multiSelect: False
# Options: Skip, Delegate
# Via Other: Full address
```
### RULE 7: Date
```
# chatfield.cli: "Date of birth?"
# multiSelect: False
# Options: Skip, Delegate (optionally: "Today", "Tomorrow")
# Via Other: Specific date
```
---
## MANDATORY Checklist
**EVERY question MUST:**
- [ ] Be based on chatfield.cli's stdout message
- [ ] Include "Skip" as option 1
- [ ] Include "Delegate" as option 2
- [ ] Check Form Data Model for multiSelect
- [ ] Add first TWO chatfield.cli options as 3-4 (if present)

View File

@@ -0,0 +1,74 @@
# CLI Interview Loop
**CRITICAL: Strict adherence required. No deviations permitted.**
Run `chatfield.cli` iteratively, presenting its output messages via AskUserQuestion(), passing responses back, repeating until complete.
**Files:**
- State: `<basename>.chatfield/interview.db`
- Interview: `<basename>.chatfield/interview.py` (or `interview_<lang>.py` if translated)
## Workflow Overview
```plantuml
@startuml CLI-Interview-Loop
title CLI Interview Loop
start
:Initialize chatfield.cli (no message);
:chatfield.cli outputs first question;
repeat
:Understand the chatfield.cli message;
:Consider the Form Data Model for multiSelect;
:Build AskUserQuestion;
:Present to user via AskUserQuestion();
:Call chatfield.cli with the result as a message;
:chatfield.cli outputs next question/response;
repeat while (Complete?) is (no)
->yes;
:Run chatfield.cli --inspect;
:Parse collected data;
stop
@enduml
```
## CLI Command Reference
```bash
# Initialize (NO user message)
python -m chatfield.cli --state=<state> --interview=<interview>
# Continue (WITH message)
python -m chatfield.cli --state=<state> --interview=<interview> "user response"
# Inspect (when complete, or any time to troubleshoot)
python -m chatfield.cli --state=<state> --interview=<interview> --inspect
```
In all cases, chatfield.cli will print to its stdout a message for the user.
## Interview Loop Process
**CRITICAL**: When building AskUserQuestion from chatfield.cli's message, you MUST strictly follow ./AskUserQuestion-Rules.md
1. Initialize: `python -m chatfield.cli --state=<state> --interview=<interview>` (NO message)
2. Read chatfield.cli's stdout message
3. Recall or look up Form Data Model for multiSelect (`.as_multi()`, `.one_or_more()` → True)
4. Build AskUserQuestion per mandatory rules: ./AskUserQuestion-Rules.md
5. Present AskUserQuestion to user
6. Handle response:
- "Other" text → pass to chatfield.cli
- "Skip" → Context-aware response: Yes/No questions → "No"; Optional/nullable fields → "N/A"; Other fields → "Skip"
- "Delegate" → research answer, pass to chatfield.cli
- Options 3-4 → pass selection to chatfield.cli
- Multi-select → join with commas, pass to chatfield.cli
7. Call: `python -m chatfield.cli --state=<state> --interview=<interview> "user response"`
8. Repeat steps 2-7 until completion signal
9. Run: `python -m chatfield.cli --state=<state> --interview=<interview> --inspect`
## Completion Signals
Watch for:
- "Thank you! I have all the information I need."
- "complete" / "done"
When Chatfield mentions the conversation is complete, stop the loop. The CLI Interview loop is done.

View File

@@ -0,0 +1,332 @@
# Converting PDF Forms to Chatfield Interviews
<purpose>
This guide covers how to build a Chatfield interview definition from PDF form data. This is the core transformation step that converts a static PDF form into a conversational interview.
</purpose>
<important>
**Read complete API reference**: See ./Data-Model-API.md for all builder methods, transformations, and validation rules.
</important>
## Process Overview
```plantuml
@startuml Converting-PDF-To-Chatfield
title Converting PDF Forms to Chatfield Interviews
start
:Prerequisites: Form extraction complete;
partition "Read Input Files" {
:Read <basename>.form.md;
:Read <basename>.form.json;
}
:Build Interview Definition;
repeat
:Validate Form Data Model
(see validation checklist);
if (All checks pass?) then (yes)
else (no)
:Fix issues identified in validation;
endif
repeat while (All checks pass?) is (no)
->yes;
:**✓ FORM DATA MODEL COMPLETE**;
:interview.py ready for next step;
stop
@enduml
```
## The Form Data Model
<definition>
The **Form Data Model** is the `interview.py` file in the `.chatfield/` working directory. This file contains the chatfield builder definition that faithfully represents the PDF form.
</definition>
## Critical Principle: Faithfulness to Original PDF
<critical_principle>
**The Form Data Model must be as accurate and faithful as possible to the source PDF.**
**Why?** Downstream code will NOT see the PDF anymore. The interview must create the "illusion" that the AI agent has full access to the form, speaking to the user, writing information - all from the Form Data Model alone.
This means every field, every instruction, every validation rule from the PDF must be captured in the interview definition.
</critical_principle>
## Language Matching Rule
**CRITICAL: Only pass English-language strings to the chatfield builder API for English-language forms.**
The chatfield object strings should virtually always match the PDF's primary language:
- `.type()` - Use short identifier (e.g., "DHFS_FoodBusinessLicense"), not full official name. **HARD LIMIT: 64 characters maximum**
- `.desc()` - Use form's language
- `.trait()` - Use form's language for Background content
- `.hint()` - Use form's language
**Translation happens LATER** (see ./Translating.md), not during initial definition.
## Key Rules
These fundamental rules apply to all Form Data Models:
1. **Faithfulness to PDF**: The interview definition must accurately represent the source PDF form
2. **Short type identifiers**: Top-level `.type()` should be a short "class name" identifier (e.g., "W9_TIN", "DHFS_FoodBusinessLicense"), not the full official form name. **HARD LIMIT: 64 characters maximum**
3. **Direct mapping default**: Use PDF field_ids directly from `.form.json` unless using fan-out patterns
4. **Fan-out patterns**: Use `.as_*()` casts to populate multiple PDF fields from single collected value
5. **Exact field_ids**: Keep field IDs from `.form.json` unchanged (use as cast names or direct field names)
6. **Extract knowledge**: ALL form instructions go into Alice traits/hints
7. **Format flexibility**: Never specify format in `.desc()` - Alice accepts variations
8. **Validation vs transformation**: `.must()` for content constraints (use SPARINGLY), `.as_*()` for formatting (use LIBERALLY). Alice NEVER mentions format requirements to Bob
9. **Language matching**: All strings (`.desc()`, `.trait()`, `.hint()`) must match the PDF's language
## Reading Input Files
Your inputs from form-extract:
- **`<basename>.chatfield/<basename>.form.md`** - PDF content as Markdown (use this for form knowledge)
- **`<basename>.chatfield/<basename>.form.json`** - Field IDs, types, and metadata
## Extracting Form Knowledge
From `.form.md`, extract ONLY actionable knowledge:
- Form purpose (1-2 sentences)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")
**Do NOT extract:**
- Decorative text
- Repeated boilerplate
- Page numbers, footers
Place extracted knowledge in interview:
- **Form-level** → Alice traits: `.trait("Background: [context]...")`
- **Field-level** → Field hints: `.hint("Background: [guidance]")`
## Builder API Patterns
### Direct Mapping (Default)
One PDF field_id → one question
```python
.field("topmostSubform[0].Page1[0].f1_01[0]")
.desc("What is your full legal name?") # English .desc() for English form
.hint("Background: Should match official records")
```
### Fan-out Pattern
Collect once, populate multiple PDF fields via `.as_*()` casts
```python
.field("age")
.desc("What is your age in years?")
.as_int("age_years", "Age as integer")
.as_bool("over_18", "True if 18 or older")
.as_str("age_display", "Age formatted for display")
```
**CRITICAL**: For fan-out, cast names MUST be exact PDF field_ids from `.form.json`
#### Re-representation Sub-pattern
When PDF has multiple fields for the same value in different formats (numeric vs words, date vs formatted date, etc.), collect ONCE and use casts:
```python
.field("amount")
.desc("What is the payment amount?")
.as_int("amount_numeric", "Amount as number")
.as_str("amount_in_words", "Amount spelled out in words (e.g., 'One hundred')")
.field("event_date")
.desc("When did the event occur?")
.as_str("date_iso", "Date in ISO format (YYYY-MM-DD)")
.as_str("date_display", "Date formatted as 'January 15, 2025'")
```
**Key principle**: Eliminate duplicate questions about the same underlying information.
### Discriminate + Split Pattern
Mutually-exclusive fields
```python
.field("tin")
.desc("Is your taxpayer ID an EIN or SSN, and what is the number?")
.must("be exactly 9 digits")
.must("indicate SSN or EIN type")
.as_str("ssn_part1", "First 3 of SSN, or empty if N/A")
.as_str("ssn_part2", "Middle 2 of SSN, or empty if N/A")
.as_str("ssn_part3", "Last 4 of SSN, or empty if N/A")
.as_str("ein_full", "Full 9-digit EIN, or empty if N/A")
```
### Expand Pattern
Multiple checkboxes from single field
```python
.field("preferences")
.desc("What are your communication preferences?")
.as_bool("email_ok", "True if wants email")
.as_bool("phone_ok", "True if wants phone calls")
.as_bool("mail_ok", "True if wants postal mail")
```
## `.must()` vs `.as_*()` Usage
**`.must()`** - CONTENT constraints (use SPARINGLY):
- Only when field MUST contain specific information
- Creates hard blocking constraint
- Example: `.must("match tax return exactly")`
**`.as_*()`** - TYPE/FORMAT transformations (use LIBERALLY):
- For any type casting, formatting, derived values
- Alice accepts variations, computes transformation
- Example: `.as_int()`, `.as_bool()`, `.as_str("name", "desc")`
**Rule of thumb**: Expect MORE `.as_*()` calls than `.must()` calls.
## Field Types
- **Text** → `.field("id").desc("question")`
- **Checkbox** → `.field("id").desc("question").as_bool()`
- **Radio/choice (required)** → `.field("id").desc("question").as_one("opt1", "opt2")`
- **Radio/choice (optional)** → `.field("id").desc("question").as_nullable_one("opt1", "opt2")`
## Optional Fields
```python
.field("middle_name")
.desc("Middle name")
.hint("Background: Optional per form instructions")
```
## Hint Conventions
All hints must have a prefix:
- **"Background:"** - Internal notes for Alice only
- Alice uses these for formatting, conversions, context without mentioning to Bob
- Example: `.hint("Background: Convert to Buddhist calendar by adding 543 years")`
- **"Tooltip:"** - May be shared with Bob if helpful
- Example: `.hint("Tooltip: Your employer provides this number")`
**See ./Data-Model-API.md** for complete list of transformations (`.as_int()`, `.as_bool()`, etc.) and cardinality options (`.as_one()`, `.as_multi()`, etc.).
## When to Use `.conclude()`
Only when derived field depends on multiple previous fields OR complex logic that can't be expressed in a single field's casts.
## Additional Guidance from PDF Forms
**Extract Knowledge Wisely:**
- Extract actionable knowledge ONLY from PDF
- Form purpose (1-2 sentences max)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")
- **Do NOT extract**: Decorative text, repeated boilerplate, page numbers, footers
**Alice Traits for Format Flexibility:**
```python
.alice()
.type("Form Assistant")
.trait("Collects information content naturally, handling all formatting invisibly")
.trait("Accepts format variations (SSN with/without hyphens)")
.trait("Background: [extracted form knowledge goes here]")
```
**Default to Direct Mapping:**
PDF field_ids are internal - users only see `.desc()`. Use field IDs directly unless using fan-out patterns.
**Format Flexibility:**
Never specify format in `.desc()` - Alice accepts variations. Use `.as_*()` for formatting requirements.
## Complete Example
```python
from chatfield import chatfield
interview = (chatfield()
.type("W9_TIN")
.desc("Form to provide TIN to entities paying income")
.alice()
.type("Tax Form Assistant")
.trait("Collects information content naturally, handling all formatting invisibly")
.trait("Accepts format variations (SSN with/without hyphens)")
.trait("Background: W-9 used to provide TIN to entities paying income")
.trait("Background: EIN for business entities, SSN for individuals")
.bob()
.type("Taxpayer completing W-9 form")
.trait("Speaks naturally and freely")
.field("name")
.desc("What is your full legal name as shown on your tax return?")
.hint("Background: Must match IRS records exactly")
.field("business_name")
.desc("Business name or disregarded entity name, if different from above")
.hint("Background: Optional - only if applicable")
.field("tin")
.desc("What is your taxpayer identification number (SSN or EIN)?")
.must("be exactly 9 digits")
.must("indicate whether SSN or EIN")
.as_str("ssn_part1", "First 3 digits of SSN, or empty if using EIN")
.as_str("ssn_part2", "Middle 2 digits of SSN, or empty if using EIN")
.as_str("ssn_part3", "Last 4 digits of SSN, or empty if using EIN")
.as_str("ein_part1", "First 2 digits of EIN, or empty if using SSN")
.as_str("ein_part2", "Last 7 digits of EIN, or empty if using SSN")
.field("address")
.desc("What is your address (number, street, apt/suite)?")
.field("city_state_zip")
.desc("What is your city, state, and ZIP code?")
.as_str("city", "City name")
.as_str("state", "State abbreviation (2 letters)")
.as_str("zip", "ZIP code")
.build()
)
```
## Validation Checklist
Before proceeding, validate the interview definition:
<validation_checklist>
```
Interview Validation Checklist:
- [ ] All field_ids from .form.json are mapped
- [ ] No field_ids duplicated or missing
- [ ] Re-representations (amount/amount_in_words, date/date_formatted, etc.) use single field with casts, not duplicate questions
- [ ] .desc() describes WHAT information is needed (content), never HOW it should be formatted
- [ ] .hint() provides context about content (e.g., "Optional", "Must match passport"), never formatting instructions
- [ ] All formatting requirements (dates, codes, number formats, etc.) use .as_*() transformations exclusively
- [ ] Fan-out patterns use .as_*() with PDF field_ids as cast names
- [ ] Split patterns use .as_*() with "or empty/0 if N/A" descriptions
- [ ] Discriminate + split uses .as_*() for mutually-exclusive fields
- [ ] Expand pattern uses .as_*() casts on single field
- [ ] .conclude() used only when necessary (multi-field dependencies)
- [ ] Alice traits include extracted form knowledge
- [ ] Field hints provide context from PDF instructions
- [ ] Optional fields explicitly marked with hint("Background: Optional...")
- [ ] .must() used sparingly (only true content requirements)
- [ ] Field .desc() questions are natural and user-friendly (no technical field_ids)
- [ ] ALL STRINGS match the PDF's primary language
```
</validation_checklist>
If any items fail:
1. Review the specific issue
2. Fix the interview definition
3. Re-run validation checklist
4. Proceed only when all items pass
## The Result: Form Data Model
When validation passes, you have successfully created the **Form Data Model** in `<basename>.chatfield/interview.py`.

View File

@@ -0,0 +1,216 @@
# Conversational Form API Reference
**Library:** `chatfield` Python package
API reference for building conversational form interviews. Powered by the Chatfield library.
## Contents
- Quick Start
- Builder API
- Interview Configuration
- Roles
- Fields
- Validation
- Special Field Types
- Transformations
- Cardinality
- Field Access
- Optional Fields
---
## Quick Start
```python
from chatfield import chatfield, Interviewer
# Define
interview = (chatfield()
.field("name")
.desc("What is your full name?")
.must("include first and last")
.field("age")
.desc("Your age?")
.as_int()
.must("be between 18 and 120")
.build())
# Run
interviewer = Interviewer(interview)
user_input = None
while not interview._done:
message = interviewer.go(user_input)
print(f"Assistant: {message}")
if not interview._done:
user_input = input("You: ").strip()
# Access
print(interview.name, interview.age.as_int)
```
---
## Builder API
### Interview Configuration
```python
interview = (chatfield()
.type("Job Application") # Interview type
.desc("Collect applicant info") # Description
.build())
```
### Roles
```python
.alice() # Configure AI assistant
.type("Tax Assistant")
.trait("Professional and accurate")
.trait("Never provides tax advice")
.bob() # Configure user
.type("Taxpayer")
.trait("Speaks colloquially")
```
### Fields
```python
.field("email") # Define field (becomes interview.email)
.desc("What is your email?") # User-facing question
```
**All fields mandatory to populate** (must be non-`None` for `._done`). Content can be empty string `""`.
Exception: `.as_one()`, `.as_multi()`, and fields with strict validation require non-empty values.
### Validation
```python
.field("email")
.must("be valid email format") # Requirement (AND logic)
.must("not be disposable")
.reject("profanity") # Block pattern
.hint("Background: Company email preferred") # Advisory (not enforced)
```
### Hints
Hints provide context and guidance to Alice. **All hints must start with "Background:" or "Tooltip:"**
```python
# Background hints: Internal notes for Alice only (not mentioned to Bob)
.hint("Background: Convert Gregorian to Buddhist calendar (+543 years)")
.hint("Background: Optional per form instructions")
# Tooltip hints: May be shared with Bob if helpful
.hint("Tooltip: Your employer should provide this number")
.hint("Tooltip: Ask your supervisor if unsure")
```
**Background hints** are for Alice's internal use - she handles formatting/conversions transparently without mentioning them to Bob.
**Tooltip hints** may be shared with Bob to help clarify what information is needed.
### Special Field Types
```python
.field("sentiment_score")
.confidential() # Track silently, never ask Bob
.field("summary")
.conclude() # Compute after regular fields (auto-confidential)
```
### Transformations
LLM computes during collection. Access via `interview.field.as_*`
```python
.field("age").as_int() # → interview.age.as_int = 25
.field("price").as_float() # → interview.price.as_float = 99.99
.field("citizen").as_bool() # → interview.citizen.as_bool = True
.field("hobbies").as_list() # → interview.hobbies.as_list = ["reading", "coding"]
.field("config").as_json() # → interview.config.as_json = {"theme": "dark"}
.field("progress").as_percent() # → interview.progress.as_percent = 0.75
.field("greeting").as_lang("fr") # → interview.greeting.as_lang_fr = "Bonjour"
# Optional descriptions guide edge cases
.field("has_partners")
.as_bool("true if you have partners; false if not or N/A")
.field("quantity")
.as_int("parse as integer, ignore units")
# Named string casts for formatting
.field("ssn")
.must("be exactly 9 digits")
.as_str("formatted", "Format as ###-##-####")
# Access: interview.ssn.as_str_formatted → "123-45-6789"
```
**Validation vs. Casts:**
- **Validation** (`.must()`): Check content ("9 digits", "valid email")
- **Casts** (`.as_*()`): Provide format (hyphens, capitalization)
### Choice Cardinality
Select from predefined options:
```python
.field("tax_class")
.as_one("Individual", "C Corp", "S Corp") # Exactly one choice required
.field("dietary")
.as_nullable_one("Vegetarian", "Vegan") # Zero or one
.field("languages")
.as_multi("Python", "JavaScript", "Go") # One or more choices required
.field("interests")
.as_nullable_multi("ML", "Web Dev", "DevOps") # Zero or more
```
### Build
```python
.build() # Return Interview instance
```
---
## Field Access
**Dot notation** (regular fields):
```python
interview.name
interview.age.as_int
```
**Bracket notation** (special characters):
```python
interview["topmostSubform[0].Page1[0].f1_01[0]"] # PDF form fields
interview["user.name"] # Dots
interview["full name"] # Spaces
interview["class"] # Reserved words
```
---
## Optional Fields
Fields known to be optional (from PDF tooltip, nearby context, or instructions):
```python
.alice()
.trait("Records optional fields as empty string when user says blank/none/skip")
.field("middle_name")
.desc("Middle name")
.hint("Background: Optional per form instructions")
.field("extension")
.desc("Phone extension")
.hint("Background: Leave blank if none")
```
For optional **choices**, use `.as_nullable_one()` or `.as_nullable_multi()` (see examples above).

View File

@@ -0,0 +1,100 @@
# Populating Fillable PDF Forms
<purpose>
After collecting data via Chatfield interview, populate fillable PDF forms with the results.
</purpose>
## Process Overview
```plantuml
@startuml Populating-Fillable
title Populating Fillable PDF Forms
start
:Parse Chatfield output;
:Read <basename>.form.json for metadata;
:Create <basename>.values.json;
repeat
:Validate .values.json
(see validation checklist);
if (All checks pass?) then (yes)
else (no)
:Fix .values.json;
endif
repeat while (All checks pass?) is (no)
->yes;
:Execute fill_fillable_fields.py;
:**✓ PDF POPULATION COMPLETE**;
stop
@enduml
```
## Process
### 1. Parse Chatfield Output
Run Chatfield with `--inspect` for a final summary of all collected data:
```bash
python -m chatfield.cli --state='<basename>.chatfield/interview.db' --interview='<basename>.chatfield/interview.py' --inspect
```
Extract `field_id` and the proper value for each field.
### 2. Create `.values.json`
Create `<basename>.values.json` in the `<basename>.chatfield/` directory with the collected field values:
```json
[
{"field_id": "name", "page": 1, "value": "John Doe"},
{"field_id": "age_years", "page": 1, "value": 25},
{"field_id": "age_display", "page": 1, "value": "25"},
{"field_id": "checkbox_over_18", "page": 1, "value": "/1"}
]
```
**Value selection priority:**
- **CRITICAL**: If a language cast exists for a field (e.g., `.as_lang_es`, `.as_lang_fr`), **always prefer it** over the raw value
- This ensures forms are populated in the form's language, not the conversation language
- The language cast name matches the form's language code (e.g., `as_lang_es` for Spanish forms)
- Only use the raw value if no language cast exists
**Boolean conversion for checkboxes:**
- Read `.form.json` for `checked_value` and `unchecked_value`
- Typically: `"/1"` or `"/On"` for checked, `"/Off"` for unchecked
- Convert Python `True`/`False` → PDF checkbox values
### 3. Validate `.values.json`
**Before running the population script**, validate the `.values.json` file against the validation checklist below:
- Verify all field_ids from `.form.json` are present
- Check checkbox values match `checked_value`/`unchecked_value` from `.form.json`
- Ensure numeric fields use numbers, not strings
- Confirm language cast values are used when available
If validation fails, fix the `.values.json` file and re-validate until all checks pass.
### 4. Populate PDF
Once validation passes, run the population script (note, the `scripts` directory is relative to the base directory for this skill):
```bash
python scripts/fill_fillable_fields.py <basename>.pdf <basename>.chatfield/<basename>.values.json <basename>.done.pdf
## Validation Checklist
<validation_checklist>
**Missing fields:**
- Check that all field_ids from `.form.json` are in `.values.json`
- Verify field_id spelling matches exactly
**Wrong checkbox values:**
- Check `checked_value`/`unchecked_value` in `.form.json`
- Common values: `/1`, `/On`, `/Yes` for checked; `/Off`, `/No` for unchecked
**Type errors:**
- Ensure numeric fields use numbers, not strings: `25` not `"25"`
- Ensure boolean checkboxes use proper values from `.form.json`
**Language translation (for translated forms):**
- Ensure language cast value is used when it exists (e.g., `as_lang_es` for Spanish forms)
</validation_checklist>

View File

@@ -0,0 +1,121 @@
# Populating Non-fillable PDF Forms
<purpose>
After collecting data via Chatfield interview, populate the non-fillable PDF with text annotations.
</purpose>
## Process Overview
```plantuml
@startuml Populating-Nonfillable
title Populating Non-fillable PDF Forms
start
:Parse Chatfield output;
:Create .values.json with field values;
:Add annotations to PDF;
:**✓ PDF POPULATION COMPLETE**;
stop
@enduml
```
## Process
### 1. Parse Chatfield Output
Run Chatfield with `--inspect` for a final summary of all collected data:
```bash
python -m chatfield.cli --state='<basename>.chatfield/interview.db' --interview='<basename>.chatfield/interview.py' --inspect
```
Extract `field_id` and value for each field from the interview results.
### 2. Create `.values.json`
Create `<basename>.chatfield/<basename>.values.json` with the collected field values in the format expected by the annotation script:
```json
{
"fields": [
{
"field_id": "full_name",
"page": 1,
"value": "John Doe"
},
{
"field_id": "is_over_18",
"page": 2,
"value": "X"
}
]
}
```
**Value selection priority:**
- **CRITICAL**: If a language cast exists for a field (e.g., `.as_lang_es`, `.as_lang_fr`), **always prefer it** over the raw value
- This ensures forms are populated in the form's language, not the conversation language
- The language cast name matches the form's language code (e.g., `as_lang_es` for Spanish forms)
- Only use the raw value if no language cast exists
**Boolean conversion for checkboxes:**
- Read `.form.json` for `checked_value` and `unchecked_value`
- Typically: `"X"` or `"✓"` for checked, `""` (empty string) for unchecked
- Convert Python `True`/`False` → checkbox display values
### 3. Add Annotations to PDF
Run the annotation script to create the filled PDF:
```bash
python scripts/fill_nonfillable_fields.py <basename>.pdf <basename>.chatfield/<basename>.values.json <basename>.done.pdf
```
This script:
- Reads the `.values.json` file with field values
- Reads the `.form.json` file (from extraction) with bounding box information
- Adds text annotations at the specified bounding boxes
- Creates the output PDF with all annotations
**Verification:**
- Verify `<basename>.done.pdf` exists
- Spot-check a few fields to ensure values are correctly placed
**Result**: `<basename>.done.pdf`
## Validation Checklist
<validation_checklist>
```
Non-fillable Population Validation:
- [ ] All field values extracted from CLI output
- [ ] Language casts used when available (not raw values)
- [ ] Boolean values converted to checkbox display values
- [ ] .values.json created with correct format
- [ ] fill_nonfillable_fields.py executed successfully
- [ ] Output PDF exists at expected location
- [ ] Spot-checked fields contain correct values
- [ ] Text is visible and properly positioned
```
</validation_checklist>
## Troubleshooting
**Text not visible:**
- Check font color in .form.json (should be dark, e.g., "000000" for black)
- Verify bounding boxes are correct size
- Ensure font size is appropriate for the bounding box
**Text cut off:**
- Bounding boxes may be too small
- Review validation images from extraction phase
- Consider adjusting bounding boxes and re-running extraction validation
**Wrong language:**
- Verify you're using language cast values (e.g., `as_lang_es`) not raw values
- Check that language casts were properly requested in the Form Data Model
---
**See Also:**
- ./Populating-Fillable.md - Population workflow for fillable PDFs
- ../extracting-form-fields/references/Nonfillable-Forms.md - How bounding boxes were created
- ./Converting-PDF-To-Chatfield.md - How the Form Data Model was built

View File

@@ -0,0 +1,218 @@
# Translating Forms for Users
<purpose>
Use this guide when the PDF form is in a language different from the user's language. This enables cross-language form completion where the user speaks one language and the form is in another.
</purpose>
## Process Overview
```plantuml
@startuml Translating
title Translating Forms for Users
start
:Prerequisites: Form Data Model created\n(form language already determined);
partition "1. Copy Form Data Model" {
:Create language-specific .py file;
}
partition "2. Edit Language-Specific Version" {
:Edit interview_<lang>.py;
partition "3. Alice Translation Traits" {
:Add translation traits to Alice;
}
partition "4. Bob Language Traits" {
:Add language trait to Bob;
}
partition "5. Field Language Casts" {
:Add .as_lang("<lang>") to all text fields;
}
}
repeat
:Validate translation setup
(see validation checklist);
if (All checks pass?) then (yes)
else (no)
:Fix issues;
endif
repeat while (All checks pass?) is (no)
->yes;
:**✓ TRANSLATION COMPLETE**;
:Re-define Form Data Model as interview_<lang>.py;
stop
@enduml
```
## Critical Principle
<critical_principle>
The **Form Data Model** (`interview.py`) was already created with the form's language.
**DO NOT recreate it.** Instead, ADAPT it for translation.
The form definition stays in the form's language. Only Alice's behavior and Bob's profile are modified to enable translation.
</critical_principle>
## Process
### 1. Copy Form Data Model
Create a language-specific .py file. Use ISO 639-1 language codes: `en`, `es`, `fr`, `de`, `zh`, `ja`, etc.
```bash
# If user speaks Spanish
cp input.chatfield/interview.py input.chatfield/interview_es.py
```
### 2. Edit Language-Specific Version
Edit `interview_<lang>.py` to add translation traits.
**What to change:**
- ✅ Alice traits - Add translation instructions
- ✅ Bob traits - Add language preference
- ✅ Text fields - Add `.as_lang("<form-lang-code>")` for translation (e.g., "es" for Spanish)
**What NOT to change:**
- ❌ Form `.type()` or `.desc()` - Keep form's language
- ❌ Field definitions - Keep all field IDs unchanged
- ❌ Field `.desc()` - Keep form's language
- ❌ Background hints - Keep form's language
- ❌ Any field IDs or cast names
### 3. Alice Translation Traits
Add these traits to Alice:
```python
.alice()
# Keep existing .type()
.trait("Conducts this conversation in [USER_LANGUAGE]")
.trait("Translates [USER_LANGUAGE] responses into [FORM_LANGUAGE] for the form")
.trait("Explains [FORM_LANGUAGE] terms in [USER_LANGUAGE]")
# Keep all existing .trait() calls
```
### 4. Bob Language Traits
Add these traits to Bob:
```python
.bob()
# Keep existing .type()
.trait("Speaks [USER_LANGUAGE] only")
# Keep all existing .trait() calls
```
### 5. Field Language Casts
Add `.as_lang("<form-lang-code>")` to **all text fields** to ensure values are translated to the form's language using ISO 639-1 language codes (es, fr, th, de, etc.):
```python
.field("field_name")
.desc("...")
.as_lang("es") # For Spanish form, use "fr" for French, "th" for Thai, etc.
# Keep all existing casts
```
## Complete Example
**Original Form Data Model** (`interview.py`):
```python
from chatfield import chatfield
interview = (chatfield()
.type("Solicitud de Visa")
.desc("Formulario de solicitud de visa de turista")
.alice()
.type("Asistente de Formularios")
.trait("Usa lenguaje claro y natural")
.trait("Acepta variaciones de formato")
.bob()
.type("Solicitante de visa")
.trait("Habla de forma natural y libre")
.field("nombre_completo")
.desc("¿Cuál es su nombre completo?")
.hint("Background: Debe coincidir con el pasaporte")
.field("fecha_nacimiento")
.desc("¿Cuál es su fecha de nacimiento?")
.as_str("dia", "Día (DD)")
.as_str("mes", "Mes (MM)")
.as_str("anio", "Año (YYYY)")
.build()
)
```
**Translated Version** (`interview_en.py` for English-speaking user):
```python
from chatfield import chatfield
interview = (chatfield()
.type("Solicitud de Visa") # Unchanged - form's language
.desc("Formulario de solicitud de visa de turista") # Unchanged
.alice()
.type("Asistente de Formularios") # Unchanged
.trait("Conducts this conversation in English") # ADDED
.trait("Translates English responses into Spanish for the form") # ADDED
.trait("Explains Spanish terms in English") # ADDED
.trait("Usa lenguaje claro y natural") # Keep existing
.trait("Acepta variaciones de formato") # Keep existing
.bob()
.type("Solicitante de visa") # Unchanged
.trait("Speaks English only") # ADDED
.trait("Habla de forma natural y libre") # Keep existing
.field("nombre_completo") # Unchanged
.desc("¿Cuál es su nombre completo?") # Unchanged - form's language
.hint("Background: Debe coincidir con el pasaporte") # Unchanged
.as_lang("es") # ADDED - translate to Spanish
.field("fecha_nacimiento") # Unchanged
.desc("¿Cuál es su fecha de nacimiento?") # Unchanged
.as_str("dia", "Día (DD)") # Unchanged
.as_str("mes", "Mes (MM)") # Unchanged
.as_str("anio", "Año (YYYY)") # Unchanged
.build()
)
```
## Validation Checklist
Before proceeding, verify ALL items:
<validation_checklist>
```
Translation Validation Checklist:
- [ ] Created interview_<lang>.py (copied from interview.py)
- [ ] No changes to form .type() or .desc()
- [ ] No changes to field definitions (field IDs)
- [ ] No changes to field .desc() (keep form's language)
- [ ] No changes to .as_*() cast names or descriptions
- [ ] No changes to Background hints (keep form's language)
- [ ] Added Alice trait: "Conducts this conversation in [USER_LANGUAGE]"
- [ ] Added Alice trait: "Translates [USER_LANGUAGE] responses into [FORM_LANGUAGE]"
- [ ] Added Alice trait: "Explains [FORM_LANGUAGE] terms in [USER_LANGUAGE]"
- [ ] Added Bob trait: "Speaks [USER_LANGUAGE] only"
- [ ] Added .as_lang("<form-lang-code>") to all text fields (e.g., "es" for Spanish)
```
</validation_checklist>
If any items fail:
1. Review the specific issue
2. Fix the interview definition
3. Re-run validation checklist
4. Proceed only when all items pass
## Re-define Form Data Model
**CRITICAL**: When translation setup is complete, the **Form Data Model** is now the language-specific version (`interview_<lang>.py`), NOT the base `interview.py`.
Use this file for all subsequent steps (CLI execution, etc.).

View File

@@ -0,0 +1,28 @@
# DO NOT add a docstring
from chatfield import chatfield
# The chatfield.cli module will import this `interview` object.
# **CRITICAL** - Replace the commented examples below with the real data definition.
interview = (chatfield()
# .type(<form id, official name, or filename>)
# .desc(<human-friendly form description>)
# Define Alice's type plus at least one trait.
# .alice()
# .type(<primary role for the AI agent>)
# .trait(<characteristic or behavior hint for the AI agent>)
# # Optional additional .trait() calls
# Define Bob's type plus at least one trait.
# .bob()
# .type(<primary role for the human user>)
# .trait(<characteristic or guidance about conversing with the user>)
# # Optional additional .trait() calls
# Define one or more fields.
# .field(<field id>)
# .desc(<human-friendly field description>)
.build()
)

View File

@@ -0,0 +1,158 @@
import json
import sys
from pypdf import PdfReader
# Extracts data for the fillable form fields in a PDF and outputs JSON that
# Claude uses to fill the fields. See forms.md.
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
def get_full_annotation_field_id(annotation):
components = []
while annotation:
field_name = annotation.get('/T')
if field_name:
components.append(field_name)
annotation = annotation.get('/Parent')
return ".".join(reversed(components)) if components else None
def make_field_dict(field, field_id):
field_dict = {"field_id": field_id}
ft = field.get('/FT')
if ft == "/Tx":
field_dict["type"] = "text"
elif ft == "/Btn":
field_dict["type"] = "checkbox" # radio groups handled separately
states = field.get("/_States_", [])
if len(states) == 2:
# "/Off" seems to always be the unchecked value, as suggested by
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
# It can be either first or second in the "/_States_" list.
if "/Off" in states:
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
field_dict["unchecked_value"] = "/Off"
else:
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
field_dict["checked_value"] = states[0]
field_dict["unchecked_value"] = states[1]
elif ft == "/Ch":
field_dict["type"] = "choice"
states = field.get("/_States_", [])
field_dict["choice_options"] = [{
"value": state[0],
"text": state[1],
} for state in states]
else:
field_dict["type"] = f"unknown ({ft})"
# Extract tooltip (TU = tooltip/user-facing text)
tooltip = field.get('/TU')
if tooltip:
field_dict["tooltip"] = tooltip
return field_dict
# Returns a list of fillable PDF fields:
# [
# {
# "field_id": "name",
# "page": 1,
# "type": ("text", "checkbox", "radio_group", or "choice")
# // Per-type additional fields described in forms.md
# },
# ]
def get_field_info(reader: PdfReader):
fields = reader.get_fields()
field_info_by_id = {}
possible_radio_names = set()
for field_id, field in fields.items():
# Skip if this is a container field with children, except that it might be
# a parent group for radio button options.
if field.get("/Kids"):
if field.get("/FT") == "/Btn":
possible_radio_names.add(field_id)
continue
field_info_by_id[field_id] = make_field_dict(field, field_id)
# Bounding rects are stored in annotations in page objects.
# Radio button options have a separate annotation for each choice;
# all choices have the same field name.
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
radio_fields_by_id = {}
for page_index, page in enumerate(reader.pages):
annotations = page.get('/Annots', [])
for ann in annotations:
field_id = get_full_annotation_field_id(ann)
if field_id in field_info_by_id:
field_info_by_id[field_id]["page"] = page_index + 1
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
elif field_id in possible_radio_names:
try:
# ann['/AP']['/N'] should have two items. One of them is '/Off',
# the other is the active value.
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
except KeyError:
continue
if len(on_values) == 1:
rect = ann.get("/Rect")
if field_id not in radio_fields_by_id:
radio_fields_by_id[field_id] = {
"field_id": field_id,
"type": "radio_group",
"page": page_index + 1,
"radio_options": [],
}
# Note: at least on macOS 15.7, Preview.app doesn't show selected
# radio buttons correctly. (It does if you remove the leading slash
# from the value, but that causes them not to appear correctly in
# Chrome/Firefox/Acrobat/etc).
radio_fields_by_id[field_id]["radio_options"].append({
"value": on_values[0],
"rect": rect,
})
# Some PDFs have form field definitions without corresponding annotations,
# so we can't tell where they are. Ignore these fields for now.
fields_with_location = []
for field_info in field_info_by_id.values():
if "page" in field_info:
fields_with_location.append(field_info)
else:
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
def sort_key(f):
if "radio_options" in f:
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
else:
rect = f.get("rect") or [0, 0, 0, 0]
adjusted_position = [-rect[1], rect[0]]
return [f.get("page"), adjusted_position]
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
sorted_fields.sort(key=sort_key)
return sorted_fields
def write_field_info(pdf_path: str, json_output_path: str):
reader = PdfReader(pdf_path)
field_info = get_field_info(reader)
with open(json_output_path, "w") as f:
json.dump(field_info, f, indent=2)
print(f"Wrote {len(field_info)} fields to {json_output_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: extract_form_field_info.py [input pdf] [output json]")
sys.exit(1)
write_field_info(sys.argv[1], sys.argv[2])

View File

@@ -0,0 +1,114 @@
import json
import sys
from pypdf import PdfReader, PdfWriter
from extract_form_field_info import get_field_info
# Fills fillable form fields in a PDF. See forms.md.
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
with open(fields_json_path) as f:
fields = json.load(f)
# Group by page number.
fields_by_page = {}
for field in fields:
if "value" in field:
field_id = field["field_id"]
page = field["page"]
if page not in fields_by_page:
fields_by_page[page] = {}
fields_by_page[page][field_id] = field["value"]
reader = PdfReader(input_pdf_path)
has_error = False
field_info = get_field_info(reader)
fields_by_ids = {f["field_id"]: f for f in field_info}
for field in fields:
existing_field = fields_by_ids.get(field["field_id"])
if not existing_field:
has_error = True
print(f"ERROR: `{field['field_id']}` is not a valid field ID")
elif field["page"] != existing_field["page"]:
has_error = True
print(f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})")
else:
if "value" in field:
err = validation_error_for_field_value(existing_field, field["value"])
if err:
print(err)
has_error = True
if has_error:
sys.exit(1)
writer = PdfWriter(clone_from=reader)
for page, field_values in fields_by_page.items():
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
# This seems to be necessary for many PDF viewers to format the form values correctly.
# It may cause the viewer to show a "save changes" dialog even if the user doesn't make any changes.
writer.set_need_appearances_writer(True)
with open(output_pdf_path, "wb") as f:
writer.write(f)
def validation_error_for_field_value(field_info, field_value):
field_type = field_info["type"]
field_id = field_info["field_id"]
if field_type == "checkbox":
checked_val = field_info["checked_value"]
unchecked_val = field_info["unchecked_value"]
if field_value != checked_val and field_value != unchecked_val:
return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"'
elif field_type == "radio_group":
option_values = [opt["value"] for opt in field_info["radio_options"]]
if field_value not in option_values:
return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}'
elif field_type == "choice":
choice_values = [opt["value"] for opt in field_info["choice_options"]]
if field_value not in choice_values:
return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}'
return None
# pypdf (at least version 5.7.0) has a bug when setting the value for a selection list field.
# In _writer.py around line 966:
#
# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0:
# txt = "\n".join(annotation.get_inherited(FA.Opt, []))
#
# The problem is that for selection lists, `get_inherited` returns a list of two-element lists like
# [["value1", "Text 1"], ["value2", "Text 2"], ...]
# This causes `join` to throw a TypeError because it expects an iterable of strings.
# The horrible workaround is to patch `get_inherited` to return a list of the value strings.
# We call the original method and adjust the return value only if the argument to `get_inherited`
# is `FA.Opt` and if the return value is a list of two-element lists.
def monkeypatch_pydpf_method():
from pypdf.generic import DictionaryObject
from pypdf.constants import FieldDictionaryAttributes
original_get_inherited = DictionaryObject.get_inherited
def patched_get_inherited(self, key: str, default = None):
result = original_get_inherited(self, key, default)
if key == FieldDictionaryAttributes.Opt:
if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
result = [r[0] for r in result]
return result
DictionaryObject.get_inherited = patched_get_inherited
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]")
sys.exit(1)
monkeypatch_pydpf_method()
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_fields(input_pdf, fields_json, output_pdf)

View File

@@ -0,0 +1,134 @@
#!/usr/bin/env python3
"""
Fills a non-fillable PDF by adding text annotations.
This script reads:
- .form.json (field definitions with bounding boxes in PDF coordinates)
- .values.json (field values from the interview)
And creates an annotated PDF with the values placed at the specified locations.
Usage:
python fill_nonfillable_fields.py <input.pdf> <basename>.chatfield/<basename>.values.json <output.pdf>
"""
import json
import sys
from pathlib import Path
from pypdf import PdfReader, PdfWriter
from pypdf.annotations import FreeText
def fill_nonfillable_pdf(input_pdf_path, values_json_path, output_pdf_path):
"""
Fill a non-fillable PDF with text annotations.
Args:
input_pdf_path: Path to the input PDF file
values_json_path: Path to .values.json file with field values
output_pdf_path: Path to write the filled PDF
"""
# Derive .form.json path from .values.json path
values_path = Path(values_json_path)
if not values_path.name.endswith('.values.json'):
raise ValueError(f"Expected .values.json file, got: {values_path.name}")
form_json_path = values_path.parent / values_path.name.replace('.values.json', '.form.json')
if not form_json_path.exists():
raise FileNotFoundError(
f"Form definition file not found: {form_json_path}\n"
f"Expected to find .form.json alongside .values.json"
)
# Load field definitions (with bounding boxes in PDF coordinates)
with open(form_json_path, 'r') as f:
form_fields = json.load(f)
# Load field values
with open(values_json_path, 'r') as f:
values_data = json.load(f)
# Create a lookup map: field_id -> value
values_map = {field['field_id']: field['value'] for field in values_data['fields']}
# Open the PDF
reader = PdfReader(input_pdf_path)
writer = PdfWriter()
# Copy all pages to writer
writer.append(reader)
# Process each form field
annotations_added = 0
for field_def in form_fields:
field_id = field_def.get('field_id')
# Get the value for this field
if field_id not in values_map:
# No value provided for this field, skip it
continue
value = values_map[field_id]
# Skip empty values
if not value:
continue
# Get field properties
page_num = field_def.get('page', 1)
rect = field_def.get('rect')
if not rect:
print(f"Warning: Field {field_id} has no rect, skipping", file=sys.stderr)
continue
# Default font settings
# Note: Font size/color may not work reliably across all PDF viewers
# https://github.com/py-pdf/pypdf/issues/2084
font_name = "Arial"
font_size = "12pt"
font_color = "000000" # Black
# Create the annotation
annotation = FreeText(
text=str(value),
rect=rect, # Already in PDF coordinates
font=font_name,
font_size=font_size,
font_color=font_color,
border_color=None,
background_color=None,
)
# Add annotation to the appropriate page (pypdf uses 0-based indexing)
writer.add_annotation(page_number=page_num - 1, annotation=annotation)
annotations_added += 1
# Save the filled PDF
with open(output_pdf_path, 'wb') as output:
writer.write(output)
print(f"Successfully filled PDF and saved to {output_pdf_path}")
print(f"Added {annotations_added} text annotations")
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_nonfillable_fields.py <input.pdf> <basename>.values.json <output.pdf>")
print()
print("Example:")
print(" python fill_nonfillable_fields.py form.pdf form.chatfield/form.values.json form.done.pdf")
sys.exit(1)
input_pdf = sys.argv[1]
values_json = sys.argv[2]
output_pdf = sys.argv[3]
try:
fill_nonfillable_pdf(input_pdf, values_json, output_pdf)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)