12 KiB
Converting PDF Forms to Chatfield Interviews
This guide covers how to build a Chatfield interview definition from PDF form data. This is the core transformation step that converts a static PDF form into a conversational interview. **Read complete API reference**: See ./Data-Model-API.md for all builder methods, transformations, and validation rules.Process Overview
@startuml Converting-PDF-To-Chatfield
title Converting PDF Forms to Chatfield Interviews
start
:Prerequisites: Form extraction complete;
partition "Read Input Files" {
:Read <basename>.form.md;
:Read <basename>.form.json;
}
:Build Interview Definition;
repeat
:Validate Form Data Model
(see validation checklist);
if (All checks pass?) then (yes)
else (no)
:Fix issues identified in validation;
endif
repeat while (All checks pass?) is (no)
->yes;
:**✓ FORM DATA MODEL COMPLETE**;
:interview.py ready for next step;
stop
@enduml
The Form Data Model
The **Form Data Model** is the `interview.py` file in the `.chatfield/` working directory. This file contains the chatfield builder definition that faithfully represents the PDF form.Critical Principle: Faithfulness to Original PDF
<critical_principle> The Form Data Model must be as accurate and faithful as possible to the source PDF.
Why? Downstream code will NOT see the PDF anymore. The interview must create the "illusion" that the AI agent has full access to the form, speaking to the user, writing information - all from the Form Data Model alone.
This means every field, every instruction, every validation rule from the PDF must be captured in the interview definition. </critical_principle>
Language Matching Rule
CRITICAL: Only pass English-language strings to the chatfield builder API for English-language forms.
The chatfield object strings should virtually always match the PDF's primary language:
.type()- Use short identifier (e.g., "DHFS_FoodBusinessLicense"), not full official name. HARD LIMIT: 64 characters maximum.desc()- Use form's language.trait()- Use form's language for Background content.hint()- Use form's language
Translation happens LATER (see ./Translating.md), not during initial definition.
Key Rules
These fundamental rules apply to all Form Data Models:
- Faithfulness to PDF: The interview definition must accurately represent the source PDF form
- Short type identifiers: Top-level
.type()should be a short "class name" identifier (e.g., "W9_TIN", "DHFS_FoodBusinessLicense"), not the full official form name. HARD LIMIT: 64 characters maximum - Direct mapping default: Use PDF field_ids directly from
.form.jsonunless using fan-out patterns - Fan-out patterns: Use
.as_*()casts to populate multiple PDF fields from single collected value - Exact field_ids: Keep field IDs from
.form.jsonunchanged (use as cast names or direct field names) - Extract knowledge: ALL form instructions go into Alice traits/hints
- Format flexibility: Never specify format in
.desc()- Alice accepts variations - Validation vs transformation:
.must()for content constraints (use SPARINGLY),.as_*()for formatting (use LIBERALLY). Alice NEVER mentions format requirements to Bob - Language matching: All strings (
.desc(),.trait(),.hint()) must match the PDF's language
Reading Input Files
Your inputs from form-extract:
<basename>.chatfield/<basename>.form.md- PDF content as Markdown (use this for form knowledge)<basename>.chatfield/<basename>.form.json- Field IDs, types, and metadata
Extracting Form Knowledge
From .form.md, extract ONLY actionable knowledge:
- Form purpose (1-2 sentences)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")
Do NOT extract:
- Decorative text
- Repeated boilerplate
- Page numbers, footers
Place extracted knowledge in interview:
- Form-level → Alice traits:
.trait("Background: [context]...") - Field-level → Field hints:
.hint("Background: [guidance]")
Builder API Patterns
Direct Mapping (Default)
One PDF field_id → one question
.field("topmostSubform[0].Page1[0].f1_01[0]")
.desc("What is your full legal name?") # English .desc() for English form
.hint("Background: Should match official records")
Fan-out Pattern
Collect once, populate multiple PDF fields via .as_*() casts
.field("age")
.desc("What is your age in years?")
.as_int("age_years", "Age as integer")
.as_bool("over_18", "True if 18 or older")
.as_str("age_display", "Age formatted for display")
CRITICAL: For fan-out, cast names MUST be exact PDF field_ids from .form.json
Re-representation Sub-pattern
When PDF has multiple fields for the same value in different formats (numeric vs words, date vs formatted date, etc.), collect ONCE and use casts:
.field("amount")
.desc("What is the payment amount?")
.as_int("amount_numeric", "Amount as number")
.as_str("amount_in_words", "Amount spelled out in words (e.g., 'One hundred')")
.field("event_date")
.desc("When did the event occur?")
.as_str("date_iso", "Date in ISO format (YYYY-MM-DD)")
.as_str("date_display", "Date formatted as 'January 15, 2025'")
Key principle: Eliminate duplicate questions about the same underlying information.
Discriminate + Split Pattern
Mutually-exclusive fields
.field("tin")
.desc("Is your taxpayer ID an EIN or SSN, and what is the number?")
.must("be exactly 9 digits")
.must("indicate SSN or EIN type")
.as_str("ssn_part1", "First 3 of SSN, or empty if N/A")
.as_str("ssn_part2", "Middle 2 of SSN, or empty if N/A")
.as_str("ssn_part3", "Last 4 of SSN, or empty if N/A")
.as_str("ein_full", "Full 9-digit EIN, or empty if N/A")
Expand Pattern
Multiple checkboxes from single field
.field("preferences")
.desc("What are your communication preferences?")
.as_bool("email_ok", "True if wants email")
.as_bool("phone_ok", "True if wants phone calls")
.as_bool("mail_ok", "True if wants postal mail")
.must() vs .as_*() Usage
.must() - CONTENT constraints (use SPARINGLY):
- Only when field MUST contain specific information
- Creates hard blocking constraint
- Example:
.must("match tax return exactly")
.as_*() - TYPE/FORMAT transformations (use LIBERALLY):
- For any type casting, formatting, derived values
- Alice accepts variations, computes transformation
- Example:
.as_int(),.as_bool(),.as_str("name", "desc")
Rule of thumb: Expect MORE .as_*() calls than .must() calls.
Field Types
- Text →
.field("id").desc("question") - Checkbox →
.field("id").desc("question").as_bool() - Radio/choice (required) →
.field("id").desc("question").as_one("opt1", "opt2") - Radio/choice (optional) →
.field("id").desc("question").as_nullable_one("opt1", "opt2")
Optional Fields
.field("middle_name")
.desc("Middle name")
.hint("Background: Optional per form instructions")
Hint Conventions
All hints must have a prefix:
- "Background:" - Internal notes for Alice only
- Alice uses these for formatting, conversions, context without mentioning to Bob
- Example:
.hint("Background: Convert to Buddhist calendar by adding 543 years")
- "Tooltip:" - May be shared with Bob if helpful
- Example:
.hint("Tooltip: Your employer provides this number")
- Example:
See ./Data-Model-API.md for complete list of transformations (.as_int(), .as_bool(), etc.) and cardinality options (.as_one(), .as_multi(), etc.).
When to Use .conclude()
Only when derived field depends on multiple previous fields OR complex logic that can't be expressed in a single field's casts.
Additional Guidance from PDF Forms
Extract Knowledge Wisely:
- Extract actionable knowledge ONLY from PDF
- Form purpose (1-2 sentences max)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")
- Do NOT extract: Decorative text, repeated boilerplate, page numbers, footers
Alice Traits for Format Flexibility:
.alice()
.type("Form Assistant")
.trait("Collects information content naturally, handling all formatting invisibly")
.trait("Accepts format variations (SSN with/without hyphens)")
.trait("Background: [extracted form knowledge goes here]")
Default to Direct Mapping:
PDF field_ids are internal - users only see .desc(). Use field IDs directly unless using fan-out patterns.
Format Flexibility:
Never specify format in .desc() - Alice accepts variations. Use .as_*() for formatting requirements.
Complete Example
from chatfield import chatfield
interview = (chatfield()
.type("W9_TIN")
.desc("Form to provide TIN to entities paying income")
.alice()
.type("Tax Form Assistant")
.trait("Collects information content naturally, handling all formatting invisibly")
.trait("Accepts format variations (SSN with/without hyphens)")
.trait("Background: W-9 used to provide TIN to entities paying income")
.trait("Background: EIN for business entities, SSN for individuals")
.bob()
.type("Taxpayer completing W-9 form")
.trait("Speaks naturally and freely")
.field("name")
.desc("What is your full legal name as shown on your tax return?")
.hint("Background: Must match IRS records exactly")
.field("business_name")
.desc("Business name or disregarded entity name, if different from above")
.hint("Background: Optional - only if applicable")
.field("tin")
.desc("What is your taxpayer identification number (SSN or EIN)?")
.must("be exactly 9 digits")
.must("indicate whether SSN or EIN")
.as_str("ssn_part1", "First 3 digits of SSN, or empty if using EIN")
.as_str("ssn_part2", "Middle 2 digits of SSN, or empty if using EIN")
.as_str("ssn_part3", "Last 4 digits of SSN, or empty if using EIN")
.as_str("ein_part1", "First 2 digits of EIN, or empty if using SSN")
.as_str("ein_part2", "Last 7 digits of EIN, or empty if using SSN")
.field("address")
.desc("What is your address (number, street, apt/suite)?")
.field("city_state_zip")
.desc("What is your city, state, and ZIP code?")
.as_str("city", "City name")
.as_str("state", "State abbreviation (2 letters)")
.as_str("zip", "ZIP code")
.build()
)
Validation Checklist
Before proceeding, validate the interview definition:
<validation_checklist>
Interview Validation Checklist:
- [ ] All field_ids from .form.json are mapped
- [ ] No field_ids duplicated or missing
- [ ] Re-representations (amount/amount_in_words, date/date_formatted, etc.) use single field with casts, not duplicate questions
- [ ] .desc() describes WHAT information is needed (content), never HOW it should be formatted
- [ ] .hint() provides context about content (e.g., "Optional", "Must match passport"), never formatting instructions
- [ ] All formatting requirements (dates, codes, number formats, etc.) use .as_*() transformations exclusively
- [ ] Fan-out patterns use .as_*() with PDF field_ids as cast names
- [ ] Split patterns use .as_*() with "or empty/0 if N/A" descriptions
- [ ] Discriminate + split uses .as_*() for mutually-exclusive fields
- [ ] Expand pattern uses .as_*() casts on single field
- [ ] .conclude() used only when necessary (multi-field dependencies)
- [ ] Alice traits include extracted form knowledge
- [ ] Field hints provide context from PDF instructions
- [ ] Optional fields explicitly marked with hint("Background: Optional...")
- [ ] .must() used sparingly (only true content requirements)
- [ ] Field .desc() questions are natural and user-friendly (no technical field_ids)
- [ ] ALL STRINGS match the PDF's primary language
</validation_checklist>
If any items fail:
- Review the specific issue
- Fix the interview definition
- Re-run validation checklist
- Proceed only when all items pass
The Result: Form Data Model
When validation passes, you have successfully created the Form Data Model in <basename>.chatfield/interview.py.