gh-jhs-chatfield-claude-fil…/skills/filling-pdf-forms/references/Converting-PDF-To-Chatfield.md

# Converting PDF Forms to Chatfield Interviews

<purpose>
This guide covers how to build a Chatfield interview definition from PDF form data. This is the core transformation step that converts a static PDF form into a conversational interview.
</purpose>

<important>
**Read complete API reference**: See ./Data-Model-API.md for all builder methods, transformations, and validation rules.
</important>

## Process Overview

```plantuml
@startuml Converting-PDF-To-Chatfield
title Converting PDF Forms to Chatfield Interviews
start
:Prerequisites: Form extraction complete;
partition "Read Input Files" {
  :Read <basename>.form.md;
  :Read <basename>.form.json;
}
:Build Interview Definition;
repeat
  :Validate Form Data Model
  (see validation checklist);
  if (All checks pass?) then (yes)
  else (no)
    :Fix issues identified in validation;
  endif
repeat while (All checks pass?) is (no)
->yes;
:**✓ FORM DATA MODEL COMPLETE**;
:interview.py ready for next step;
stop
@enduml
```

## The Form Data Model

<definition>
The **Form Data Model** is the `interview.py` file in the `.chatfield/` working directory. This file contains the chatfield builder definition that faithfully represents the PDF form.
</definition>

## Critical Principle: Faithfulness to Original PDF

<critical_principle>
**The Form Data Model must be as accurate and faithful as possible to the source PDF.**

**Why?** Downstream code will NOT see the PDF anymore. The interview must create the "illusion" that the AI agent has full access to the form, speaking to the user, writing information - all from the Form Data Model alone.

This means every field, every instruction, every validation rule from the PDF must be captured in the interview definition.
</critical_principle>

## Language Matching Rule

**CRITICAL: Only pass English-language strings to the chatfield builder API for English-language forms.**

The chatfield object strings should virtually always match the PDF's primary language:
- `.type()` - Use short identifier (e.g., "DHFS_FoodBusinessLicense"), not full official name. **HARD LIMIT: 64 characters maximum**
- `.desc()` - Use form's language
- `.trait()` - Use form's language for Background content
- `.hint()` - Use form's language

**Translation happens LATER** (see ./Translating.md), not during initial definition.

## Key Rules

These fundamental rules apply to all Form Data Models:

1. **Faithfulness to PDF**: The interview definition must accurately represent the source PDF form
2. **Short type identifiers**: Top-level `.type()` should be a short "class name" identifier (e.g., "W9_TIN", "DHFS_FoodBusinessLicense"), not the full official form name. **HARD LIMIT: 64 characters maximum**
3. **Direct mapping default**: Use PDF field_ids directly from `.form.json` unless using fan-out patterns
4. **Fan-out patterns**: Use `.as_*()` casts to populate multiple PDF fields from single collected value
5. **Exact field_ids**: Keep field IDs from `.form.json` unchanged (use as cast names or direct field names)
6. **Extract knowledge**: ALL form instructions go into Alice traits/hints
7. **Format flexibility**: Never specify format in `.desc()` - Alice accepts variations
8. **Validation vs transformation**: `.must()` for content constraints (use SPARINGLY), `.as_*()` for formatting (use LIBERALLY). Alice NEVER mentions format requirements to Bob
9. **Language matching**: All strings (`.desc()`, `.trait()`, `.hint()`) must match the PDF's language

## Reading Input Files

Your inputs from form-extract:
- **`<basename>.chatfield/<basename>.form.md`** - PDF content as Markdown (use this for form knowledge)
- **`<basename>.chatfield/<basename>.form.json`** - Field IDs, types, and metadata

## Extracting Form Knowledge

From `.form.md`, extract ONLY actionable knowledge:
- Form purpose (1-2 sentences)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")

**Do NOT extract:**
- Decorative text
- Repeated boilerplate
- Page numbers, footers

Place extracted knowledge in interview:
- **Form-level** → Alice traits: `.trait("Background: [context]...")`
- **Field-level** → Field hints: `.hint("Background: [guidance]")`

## Builder API Patterns

### Direct Mapping (Default)

One PDF field_id → one question

```python
.field("topmostSubform[0].Page1[0].f1_01[0]")
    .desc("What is your full legal name?")  # English .desc() for English form
    .hint("Background: Should match official records")
```

### Fan-out Pattern

Collect once, populate multiple PDF fields via `.as_*()` casts

```python
.field("age")
    .desc("What is your age in years?")
    .as_int("age_years", "Age as integer")
    .as_bool("over_18", "True if 18 or older")
    .as_str("age_display", "Age formatted for display")
```

**CRITICAL**: For fan-out, cast names MUST be exact PDF field_ids from `.form.json`

#### Re-representation Sub-pattern

When PDF has multiple fields for the same value in different formats (numeric vs words, date vs formatted date, etc.), collect ONCE and use casts:

```python
.field("amount")
    .desc("What is the payment amount?")
    .as_int("amount_numeric", "Amount as number")
    .as_str("amount_in_words", "Amount spelled out in words (e.g., 'One hundred')")

.field("event_date")
    .desc("When did the event occur?")
    .as_str("date_iso", "Date in ISO format (YYYY-MM-DD)")
    .as_str("date_display", "Date formatted as 'January 15, 2025'")
```

**Key principle**: Eliminate duplicate questions about the same underlying information.

### Discriminate + Split Pattern

Mutually-exclusive fields

```python
.field("tin")
    .desc("Is your taxpayer ID an EIN or SSN, and what is the number?")
    .must("be exactly 9 digits")
    .must("indicate SSN or EIN type")
    .as_str("ssn_part1", "First 3 of SSN, or empty if N/A")
    .as_str("ssn_part2", "Middle 2 of SSN, or empty if N/A")
    .as_str("ssn_part3", "Last 4 of SSN, or empty if N/A")
    .as_str("ein_full", "Full 9-digit EIN, or empty if N/A")
```

### Expand Pattern

Multiple checkboxes from single field

```python
.field("preferences")
    .desc("What are your communication preferences?")
    .as_bool("email_ok", "True if wants email")
    .as_bool("phone_ok", "True if wants phone calls")
    .as_bool("mail_ok", "True if wants postal mail")
```

## `.must()` vs `.as_*()` Usage

**`.must()`** - CONTENT constraints (use SPARINGLY):
- Only when field MUST contain specific information
- Creates hard blocking constraint
- Example: `.must("match tax return exactly")`

**`.as_*()`** - TYPE/FORMAT transformations (use LIBERALLY):
- For any type casting, formatting, derived values
- Alice accepts variations, computes transformation
- Example: `.as_int()`, `.as_bool()`, `.as_str("name", "desc")`

**Rule of thumb**: Expect MORE `.as_*()` calls than `.must()` calls.

## Field Types

- **Text** → `.field("id").desc("question")`
- **Checkbox** → `.field("id").desc("question").as_bool()`
- **Radio/choice (required)** → `.field("id").desc("question").as_one("opt1", "opt2")`
- **Radio/choice (optional)** → `.field("id").desc("question").as_nullable_one("opt1", "opt2")`

## Optional Fields

```python
.field("middle_name")
    .desc("Middle name")
    .hint("Background: Optional per form instructions")
```

## Hint Conventions

All hints must have a prefix:

- **"Background:"** - Internal notes for Alice only
  - Alice uses these for formatting, conversions, context without mentioning to Bob
  - Example: `.hint("Background: Convert to Buddhist calendar by adding 543 years")`
- **"Tooltip:"** - May be shared with Bob if helpful
  - Example: `.hint("Tooltip: Your employer provides this number")`

**See ./Data-Model-API.md** for complete list of transformations (`.as_int()`, `.as_bool()`, etc.) and cardinality options (`.as_one()`, `.as_multi()`, etc.).

## When to Use `.conclude()`

Only when derived field depends on multiple previous fields OR complex logic that can't be expressed in a single field's casts.

## Additional Guidance from PDF Forms

**Extract Knowledge Wisely:**
- Extract actionable knowledge ONLY from PDF
- Form purpose (1-2 sentences max)
- Key term definitions
- Field completion instructions
- Valid options/codes
- Decision logic ("If X then Y")
- **Do NOT extract**: Decorative text, repeated boilerplate, page numbers, footers

**Alice Traits for Format Flexibility:**
```python
.alice()
    .type("Form Assistant")
    .trait("Collects information content naturally, handling all formatting invisibly")
    .trait("Accepts format variations (SSN with/without hyphens)")
    .trait("Background: [extracted form knowledge goes here]")
```

**Default to Direct Mapping:**
PDF field_ids are internal - users only see `.desc()`. Use field IDs directly unless using fan-out patterns.

**Format Flexibility:**
Never specify format in `.desc()` - Alice accepts variations. Use `.as_*()` for formatting requirements.

## Complete Example

```python
from chatfield import chatfield

interview = (chatfield()
    .type("W9_TIN")
    .desc("Form to provide TIN to entities paying income")

    .alice()
        .type("Tax Form Assistant")
        .trait("Collects information content naturally, handling all formatting invisibly")
        .trait("Accepts format variations (SSN with/without hyphens)")
        .trait("Background: W-9 used to provide TIN to entities paying income")
        .trait("Background: EIN for business entities, SSN for individuals")

    .bob()
        .type("Taxpayer completing W-9 form")
        .trait("Speaks naturally and freely")

    .field("name")
        .desc("What is your full legal name as shown on your tax return?")
        .hint("Background: Must match IRS records exactly")

    .field("business_name")
        .desc("Business name or disregarded entity name, if different from above")
        .hint("Background: Optional - only if applicable")

    .field("tin")
        .desc("What is your taxpayer identification number (SSN or EIN)?")
        .must("be exactly 9 digits")
        .must("indicate whether SSN or EIN")
        .as_str("ssn_part1", "First 3 digits of SSN, or empty if using EIN")
        .as_str("ssn_part2", "Middle 2 digits of SSN, or empty if using EIN")
        .as_str("ssn_part3", "Last 4 digits of SSN, or empty if using EIN")
        .as_str("ein_part1", "First 2 digits of EIN, or empty if using SSN")
        .as_str("ein_part2", "Last 7 digits of EIN, or empty if using SSN")

    .field("address")
        .desc("What is your address (number, street, apt/suite)?")

    .field("city_state_zip")
        .desc("What is your city, state, and ZIP code?")
        .as_str("city", "City name")
        .as_str("state", "State abbreviation (2 letters)")
        .as_str("zip", "ZIP code")

    .build()
)
```

## Validation Checklist

Before proceeding, validate the interview definition:

<validation_checklist>
```
Interview Validation Checklist:
- [ ] All field_ids from .form.json are mapped
- [ ] No field_ids duplicated or missing
- [ ] Re-representations (amount/amount_in_words, date/date_formatted, etc.) use single field with casts, not duplicate questions
- [ ] .desc() describes WHAT information is needed (content), never HOW it should be formatted
- [ ] .hint() provides context about content (e.g., "Optional", "Must match passport"), never formatting instructions
- [ ] All formatting requirements (dates, codes, number formats, etc.) use .as_*() transformations exclusively
- [ ] Fan-out patterns use .as_*() with PDF field_ids as cast names
- [ ] Split patterns use .as_*() with "or empty/0 if N/A" descriptions
- [ ] Discriminate + split uses .as_*() for mutually-exclusive fields
- [ ] Expand pattern uses .as_*() casts on single field
- [ ] .conclude() used only when necessary (multi-field dependencies)
- [ ] Alice traits include extracted form knowledge
- [ ] Field hints provide context from PDF instructions
- [ ] Optional fields explicitly marked with hint("Background: Optional...")
- [ ] .must() used sparingly (only true content requirements)
- [ ] Field .desc() questions are natural and user-friendly (no technical field_ids)
- [ ] ALL STRINGS match the PDF's primary language
```
</validation_checklist>

If any items fail:
1. Review the specific issue
2. Fix the interview definition
3. Re-run validation checklist
4. Proceed only when all items pass

## The Result: Form Data Model

When validation passes, you have successfully created the **Form Data Model** in `<basename>.chatfield/interview.py`.