# Converting PDF Forms to Chatfield Interviews This guide covers how to build a Chatfield interview definition from PDF form data. This is the core transformation step that converts a static PDF form into a conversational interview. **Read complete API reference**: See ./Data-Model-API.md for all builder methods, transformations, and validation rules. ## Process Overview ```plantuml @startuml Converting-PDF-To-Chatfield title Converting PDF Forms to Chatfield Interviews start :Prerequisites: Form extraction complete; partition "Read Input Files" { :Read .form.md; :Read .form.json; } :Build Interview Definition; repeat :Validate Form Data Model (see validation checklist); if (All checks pass?) then (yes) else (no) :Fix issues identified in validation; endif repeat while (All checks pass?) is (no) ->yes; :**✓ FORM DATA MODEL COMPLETE**; :interview.py ready for next step; stop @enduml ``` ## The Form Data Model The **Form Data Model** is the `interview.py` file in the `.chatfield/` working directory. This file contains the chatfield builder definition that faithfully represents the PDF form. ## Critical Principle: Faithfulness to Original PDF **The Form Data Model must be as accurate and faithful as possible to the source PDF.** **Why?** Downstream code will NOT see the PDF anymore. The interview must create the "illusion" that the AI agent has full access to the form, speaking to the user, writing information - all from the Form Data Model alone. This means every field, every instruction, every validation rule from the PDF must be captured in the interview definition. ## Language Matching Rule **CRITICAL: Only pass English-language strings to the chatfield builder API for English-language forms.** The chatfield object strings should virtually always match the PDF's primary language: - `.type()` - Use short identifier (e.g., "DHFS_FoodBusinessLicense"), not full official name. **HARD LIMIT: 64 characters maximum** - `.desc()` - Use form's language - `.trait()` - Use form's language for Background content - `.hint()` - Use form's language **Translation happens LATER** (see ./Translating.md), not during initial definition. ## Key Rules These fundamental rules apply to all Form Data Models: 1. **Faithfulness to PDF**: The interview definition must accurately represent the source PDF form 2. **Short type identifiers**: Top-level `.type()` should be a short "class name" identifier (e.g., "W9_TIN", "DHFS_FoodBusinessLicense"), not the full official form name. **HARD LIMIT: 64 characters maximum** 3. **Direct mapping default**: Use PDF field_ids directly from `.form.json` unless using fan-out patterns 4. **Fan-out patterns**: Use `.as_*()` casts to populate multiple PDF fields from single collected value 5. **Exact field_ids**: Keep field IDs from `.form.json` unchanged (use as cast names or direct field names) 6. **Extract knowledge**: ALL form instructions go into Alice traits/hints 7. **Format flexibility**: Never specify format in `.desc()` - Alice accepts variations 8. **Validation vs transformation**: `.must()` for content constraints (use SPARINGLY), `.as_*()` for formatting (use LIBERALLY). Alice NEVER mentions format requirements to Bob 9. **Language matching**: All strings (`.desc()`, `.trait()`, `.hint()`) must match the PDF's language ## Reading Input Files Your inputs from form-extract: - **`.chatfield/.form.md`** - PDF content as Markdown (use this for form knowledge) - **`.chatfield/.form.json`** - Field IDs, types, and metadata ## Extracting Form Knowledge From `.form.md`, extract ONLY actionable knowledge: - Form purpose (1-2 sentences) - Key term definitions - Field completion instructions - Valid options/codes - Decision logic ("If X then Y") **Do NOT extract:** - Decorative text - Repeated boilerplate - Page numbers, footers Place extracted knowledge in interview: - **Form-level** → Alice traits: `.trait("Background: [context]...")` - **Field-level** → Field hints: `.hint("Background: [guidance]")` ## Builder API Patterns ### Direct Mapping (Default) One PDF field_id → one question ```python .field("topmostSubform[0].Page1[0].f1_01[0]") .desc("What is your full legal name?") # English .desc() for English form .hint("Background: Should match official records") ``` ### Fan-out Pattern Collect once, populate multiple PDF fields via `.as_*()` casts ```python .field("age") .desc("What is your age in years?") .as_int("age_years", "Age as integer") .as_bool("over_18", "True if 18 or older") .as_str("age_display", "Age formatted for display") ``` **CRITICAL**: For fan-out, cast names MUST be exact PDF field_ids from `.form.json` #### Re-representation Sub-pattern When PDF has multiple fields for the same value in different formats (numeric vs words, date vs formatted date, etc.), collect ONCE and use casts: ```python .field("amount") .desc("What is the payment amount?") .as_int("amount_numeric", "Amount as number") .as_str("amount_in_words", "Amount spelled out in words (e.g., 'One hundred')") .field("event_date") .desc("When did the event occur?") .as_str("date_iso", "Date in ISO format (YYYY-MM-DD)") .as_str("date_display", "Date formatted as 'January 15, 2025'") ``` **Key principle**: Eliminate duplicate questions about the same underlying information. ### Discriminate + Split Pattern Mutually-exclusive fields ```python .field("tin") .desc("Is your taxpayer ID an EIN or SSN, and what is the number?") .must("be exactly 9 digits") .must("indicate SSN or EIN type") .as_str("ssn_part1", "First 3 of SSN, or empty if N/A") .as_str("ssn_part2", "Middle 2 of SSN, or empty if N/A") .as_str("ssn_part3", "Last 4 of SSN, or empty if N/A") .as_str("ein_full", "Full 9-digit EIN, or empty if N/A") ``` ### Expand Pattern Multiple checkboxes from single field ```python .field("preferences") .desc("What are your communication preferences?") .as_bool("email_ok", "True if wants email") .as_bool("phone_ok", "True if wants phone calls") .as_bool("mail_ok", "True if wants postal mail") ``` ## `.must()` vs `.as_*()` Usage **`.must()`** - CONTENT constraints (use SPARINGLY): - Only when field MUST contain specific information - Creates hard blocking constraint - Example: `.must("match tax return exactly")` **`.as_*()`** - TYPE/FORMAT transformations (use LIBERALLY): - For any type casting, formatting, derived values - Alice accepts variations, computes transformation - Example: `.as_int()`, `.as_bool()`, `.as_str("name", "desc")` **Rule of thumb**: Expect MORE `.as_*()` calls than `.must()` calls. ## Field Types - **Text** → `.field("id").desc("question")` - **Checkbox** → `.field("id").desc("question").as_bool()` - **Radio/choice (required)** → `.field("id").desc("question").as_one("opt1", "opt2")` - **Radio/choice (optional)** → `.field("id").desc("question").as_nullable_one("opt1", "opt2")` ## Optional Fields ```python .field("middle_name") .desc("Middle name") .hint("Background: Optional per form instructions") ``` ## Hint Conventions All hints must have a prefix: - **"Background:"** - Internal notes for Alice only - Alice uses these for formatting, conversions, context without mentioning to Bob - Example: `.hint("Background: Convert to Buddhist calendar by adding 543 years")` - **"Tooltip:"** - May be shared with Bob if helpful - Example: `.hint("Tooltip: Your employer provides this number")` **See ./Data-Model-API.md** for complete list of transformations (`.as_int()`, `.as_bool()`, etc.) and cardinality options (`.as_one()`, `.as_multi()`, etc.). ## When to Use `.conclude()` Only when derived field depends on multiple previous fields OR complex logic that can't be expressed in a single field's casts. ## Additional Guidance from PDF Forms **Extract Knowledge Wisely:** - Extract actionable knowledge ONLY from PDF - Form purpose (1-2 sentences max) - Key term definitions - Field completion instructions - Valid options/codes - Decision logic ("If X then Y") - **Do NOT extract**: Decorative text, repeated boilerplate, page numbers, footers **Alice Traits for Format Flexibility:** ```python .alice() .type("Form Assistant") .trait("Collects information content naturally, handling all formatting invisibly") .trait("Accepts format variations (SSN with/without hyphens)") .trait("Background: [extracted form knowledge goes here]") ``` **Default to Direct Mapping:** PDF field_ids are internal - users only see `.desc()`. Use field IDs directly unless using fan-out patterns. **Format Flexibility:** Never specify format in `.desc()` - Alice accepts variations. Use `.as_*()` for formatting requirements. ## Complete Example ```python from chatfield import chatfield interview = (chatfield() .type("W9_TIN") .desc("Form to provide TIN to entities paying income") .alice() .type("Tax Form Assistant") .trait("Collects information content naturally, handling all formatting invisibly") .trait("Accepts format variations (SSN with/without hyphens)") .trait("Background: W-9 used to provide TIN to entities paying income") .trait("Background: EIN for business entities, SSN for individuals") .bob() .type("Taxpayer completing W-9 form") .trait("Speaks naturally and freely") .field("name") .desc("What is your full legal name as shown on your tax return?") .hint("Background: Must match IRS records exactly") .field("business_name") .desc("Business name or disregarded entity name, if different from above") .hint("Background: Optional - only if applicable") .field("tin") .desc("What is your taxpayer identification number (SSN or EIN)?") .must("be exactly 9 digits") .must("indicate whether SSN or EIN") .as_str("ssn_part1", "First 3 digits of SSN, or empty if using EIN") .as_str("ssn_part2", "Middle 2 digits of SSN, or empty if using EIN") .as_str("ssn_part3", "Last 4 digits of SSN, or empty if using EIN") .as_str("ein_part1", "First 2 digits of EIN, or empty if using SSN") .as_str("ein_part2", "Last 7 digits of EIN, or empty if using SSN") .field("address") .desc("What is your address (number, street, apt/suite)?") .field("city_state_zip") .desc("What is your city, state, and ZIP code?") .as_str("city", "City name") .as_str("state", "State abbreviation (2 letters)") .as_str("zip", "ZIP code") .build() ) ``` ## Validation Checklist Before proceeding, validate the interview definition: ``` Interview Validation Checklist: - [ ] All field_ids from .form.json are mapped - [ ] No field_ids duplicated or missing - [ ] Re-representations (amount/amount_in_words, date/date_formatted, etc.) use single field with casts, not duplicate questions - [ ] .desc() describes WHAT information is needed (content), never HOW it should be formatted - [ ] .hint() provides context about content (e.g., "Optional", "Must match passport"), never formatting instructions - [ ] All formatting requirements (dates, codes, number formats, etc.) use .as_*() transformations exclusively - [ ] Fan-out patterns use .as_*() with PDF field_ids as cast names - [ ] Split patterns use .as_*() with "or empty/0 if N/A" descriptions - [ ] Discriminate + split uses .as_*() for mutually-exclusive fields - [ ] Expand pattern uses .as_*() casts on single field - [ ] .conclude() used only when necessary (multi-field dependencies) - [ ] Alice traits include extracted form knowledge - [ ] Field hints provide context from PDF instructions - [ ] Optional fields explicitly marked with hint("Background: Optional...") - [ ] .must() used sparingly (only true content requirements) - [ ] Field .desc() questions are natural and user-friendly (no technical field_ids) - [ ] ALL STRINGS match the PDF's primary language ``` If any items fail: 1. Review the specific issue 2. Fix the interview definition 3. Re-run validation checklist 4. Proceed only when all items pass ## The Result: Form Data Model When validation passes, you have successfully created the **Form Data Model** in `.chatfield/interview.py`.