Initial commit
This commit is contained in:
@@ -0,0 +1,332 @@
|
||||
# Converting PDF Forms to Chatfield Interviews
|
||||
|
||||
<purpose>
|
||||
This guide covers how to build a Chatfield interview definition from PDF form data. This is the core transformation step that converts a static PDF form into a conversational interview.
|
||||
</purpose>
|
||||
|
||||
<important>
|
||||
**Read complete API reference**: See ./Data-Model-API.md for all builder methods, transformations, and validation rules.
|
||||
</important>
|
||||
|
||||
## Process Overview
|
||||
|
||||
```plantuml
|
||||
@startuml Converting-PDF-To-Chatfield
|
||||
title Converting PDF Forms to Chatfield Interviews
|
||||
start
|
||||
:Prerequisites: Form extraction complete;
|
||||
partition "Read Input Files" {
|
||||
:Read <basename>.form.md;
|
||||
:Read <basename>.form.json;
|
||||
}
|
||||
:Build Interview Definition;
|
||||
repeat
|
||||
:Validate Form Data Model
|
||||
(see validation checklist);
|
||||
if (All checks pass?) then (yes)
|
||||
else (no)
|
||||
:Fix issues identified in validation;
|
||||
endif
|
||||
repeat while (All checks pass?) is (no)
|
||||
->yes;
|
||||
:**✓ FORM DATA MODEL COMPLETE**;
|
||||
:interview.py ready for next step;
|
||||
stop
|
||||
@enduml
|
||||
```
|
||||
|
||||
## The Form Data Model
|
||||
|
||||
<definition>
|
||||
The **Form Data Model** is the `interview.py` file in the `.chatfield/` working directory. This file contains the chatfield builder definition that faithfully represents the PDF form.
|
||||
</definition>
|
||||
|
||||
## Critical Principle: Faithfulness to Original PDF
|
||||
|
||||
<critical_principle>
|
||||
**The Form Data Model must be as accurate and faithful as possible to the source PDF.**
|
||||
|
||||
**Why?** Downstream code will NOT see the PDF anymore. The interview must create the "illusion" that the AI agent has full access to the form, speaking to the user, writing information - all from the Form Data Model alone.
|
||||
|
||||
This means every field, every instruction, every validation rule from the PDF must be captured in the interview definition.
|
||||
</critical_principle>
|
||||
|
||||
## Language Matching Rule
|
||||
|
||||
**CRITICAL: Only pass English-language strings to the chatfield builder API for English-language forms.**
|
||||
|
||||
The chatfield object strings should virtually always match the PDF's primary language:
|
||||
- `.type()` - Use short identifier (e.g., "DHFS_FoodBusinessLicense"), not full official name. **HARD LIMIT: 64 characters maximum**
|
||||
- `.desc()` - Use form's language
|
||||
- `.trait()` - Use form's language for Background content
|
||||
- `.hint()` - Use form's language
|
||||
|
||||
**Translation happens LATER** (see ./Translating.md), not during initial definition.
|
||||
|
||||
## Key Rules
|
||||
|
||||
These fundamental rules apply to all Form Data Models:
|
||||
|
||||
1. **Faithfulness to PDF**: The interview definition must accurately represent the source PDF form
|
||||
2. **Short type identifiers**: Top-level `.type()` should be a short "class name" identifier (e.g., "W9_TIN", "DHFS_FoodBusinessLicense"), not the full official form name. **HARD LIMIT: 64 characters maximum**
|
||||
3. **Direct mapping default**: Use PDF field_ids directly from `.form.json` unless using fan-out patterns
|
||||
4. **Fan-out patterns**: Use `.as_*()` casts to populate multiple PDF fields from single collected value
|
||||
5. **Exact field_ids**: Keep field IDs from `.form.json` unchanged (use as cast names or direct field names)
|
||||
6. **Extract knowledge**: ALL form instructions go into Alice traits/hints
|
||||
7. **Format flexibility**: Never specify format in `.desc()` - Alice accepts variations
|
||||
8. **Validation vs transformation**: `.must()` for content constraints (use SPARINGLY), `.as_*()` for formatting (use LIBERALLY). Alice NEVER mentions format requirements to Bob
|
||||
9. **Language matching**: All strings (`.desc()`, `.trait()`, `.hint()`) must match the PDF's language
|
||||
|
||||
## Reading Input Files
|
||||
|
||||
Your inputs from form-extract:
|
||||
- **`<basename>.chatfield/<basename>.form.md`** - PDF content as Markdown (use this for form knowledge)
|
||||
- **`<basename>.chatfield/<basename>.form.json`** - Field IDs, types, and metadata
|
||||
|
||||
## Extracting Form Knowledge
|
||||
|
||||
From `.form.md`, extract ONLY actionable knowledge:
|
||||
- Form purpose (1-2 sentences)
|
||||
- Key term definitions
|
||||
- Field completion instructions
|
||||
- Valid options/codes
|
||||
- Decision logic ("If X then Y")
|
||||
|
||||
**Do NOT extract:**
|
||||
- Decorative text
|
||||
- Repeated boilerplate
|
||||
- Page numbers, footers
|
||||
|
||||
Place extracted knowledge in interview:
|
||||
- **Form-level** → Alice traits: `.trait("Background: [context]...")`
|
||||
- **Field-level** → Field hints: `.hint("Background: [guidance]")`
|
||||
|
||||
## Builder API Patterns
|
||||
|
||||
### Direct Mapping (Default)
|
||||
|
||||
One PDF field_id → one question
|
||||
|
||||
```python
|
||||
.field("topmostSubform[0].Page1[0].f1_01[0]")
|
||||
.desc("What is your full legal name?") # English .desc() for English form
|
||||
.hint("Background: Should match official records")
|
||||
```
|
||||
|
||||
### Fan-out Pattern
|
||||
|
||||
Collect once, populate multiple PDF fields via `.as_*()` casts
|
||||
|
||||
```python
|
||||
.field("age")
|
||||
.desc("What is your age in years?")
|
||||
.as_int("age_years", "Age as integer")
|
||||
.as_bool("over_18", "True if 18 or older")
|
||||
.as_str("age_display", "Age formatted for display")
|
||||
```
|
||||
|
||||
**CRITICAL**: For fan-out, cast names MUST be exact PDF field_ids from `.form.json`
|
||||
|
||||
#### Re-representation Sub-pattern
|
||||
|
||||
When PDF has multiple fields for the same value in different formats (numeric vs words, date vs formatted date, etc.), collect ONCE and use casts:
|
||||
|
||||
```python
|
||||
.field("amount")
|
||||
.desc("What is the payment amount?")
|
||||
.as_int("amount_numeric", "Amount as number")
|
||||
.as_str("amount_in_words", "Amount spelled out in words (e.g., 'One hundred')")
|
||||
|
||||
.field("event_date")
|
||||
.desc("When did the event occur?")
|
||||
.as_str("date_iso", "Date in ISO format (YYYY-MM-DD)")
|
||||
.as_str("date_display", "Date formatted as 'January 15, 2025'")
|
||||
```
|
||||
|
||||
**Key principle**: Eliminate duplicate questions about the same underlying information.
|
||||
|
||||
### Discriminate + Split Pattern
|
||||
|
||||
Mutually-exclusive fields
|
||||
|
||||
```python
|
||||
.field("tin")
|
||||
.desc("Is your taxpayer ID an EIN or SSN, and what is the number?")
|
||||
.must("be exactly 9 digits")
|
||||
.must("indicate SSN or EIN type")
|
||||
.as_str("ssn_part1", "First 3 of SSN, or empty if N/A")
|
||||
.as_str("ssn_part2", "Middle 2 of SSN, or empty if N/A")
|
||||
.as_str("ssn_part3", "Last 4 of SSN, or empty if N/A")
|
||||
.as_str("ein_full", "Full 9-digit EIN, or empty if N/A")
|
||||
```
|
||||
|
||||
### Expand Pattern
|
||||
|
||||
Multiple checkboxes from single field
|
||||
|
||||
```python
|
||||
.field("preferences")
|
||||
.desc("What are your communication preferences?")
|
||||
.as_bool("email_ok", "True if wants email")
|
||||
.as_bool("phone_ok", "True if wants phone calls")
|
||||
.as_bool("mail_ok", "True if wants postal mail")
|
||||
```
|
||||
|
||||
## `.must()` vs `.as_*()` Usage
|
||||
|
||||
**`.must()`** - CONTENT constraints (use SPARINGLY):
|
||||
- Only when field MUST contain specific information
|
||||
- Creates hard blocking constraint
|
||||
- Example: `.must("match tax return exactly")`
|
||||
|
||||
**`.as_*()`** - TYPE/FORMAT transformations (use LIBERALLY):
|
||||
- For any type casting, formatting, derived values
|
||||
- Alice accepts variations, computes transformation
|
||||
- Example: `.as_int()`, `.as_bool()`, `.as_str("name", "desc")`
|
||||
|
||||
**Rule of thumb**: Expect MORE `.as_*()` calls than `.must()` calls.
|
||||
|
||||
## Field Types
|
||||
|
||||
- **Text** → `.field("id").desc("question")`
|
||||
- **Checkbox** → `.field("id").desc("question").as_bool()`
|
||||
- **Radio/choice (required)** → `.field("id").desc("question").as_one("opt1", "opt2")`
|
||||
- **Radio/choice (optional)** → `.field("id").desc("question").as_nullable_one("opt1", "opt2")`
|
||||
|
||||
## Optional Fields
|
||||
|
||||
```python
|
||||
.field("middle_name")
|
||||
.desc("Middle name")
|
||||
.hint("Background: Optional per form instructions")
|
||||
```
|
||||
|
||||
## Hint Conventions
|
||||
|
||||
All hints must have a prefix:
|
||||
|
||||
- **"Background:"** - Internal notes for Alice only
|
||||
- Alice uses these for formatting, conversions, context without mentioning to Bob
|
||||
- Example: `.hint("Background: Convert to Buddhist calendar by adding 543 years")`
|
||||
- **"Tooltip:"** - May be shared with Bob if helpful
|
||||
- Example: `.hint("Tooltip: Your employer provides this number")`
|
||||
|
||||
**See ./Data-Model-API.md** for complete list of transformations (`.as_int()`, `.as_bool()`, etc.) and cardinality options (`.as_one()`, `.as_multi()`, etc.).
|
||||
|
||||
## When to Use `.conclude()`
|
||||
|
||||
Only when derived field depends on multiple previous fields OR complex logic that can't be expressed in a single field's casts.
|
||||
|
||||
## Additional Guidance from PDF Forms
|
||||
|
||||
**Extract Knowledge Wisely:**
|
||||
- Extract actionable knowledge ONLY from PDF
|
||||
- Form purpose (1-2 sentences max)
|
||||
- Key term definitions
|
||||
- Field completion instructions
|
||||
- Valid options/codes
|
||||
- Decision logic ("If X then Y")
|
||||
- **Do NOT extract**: Decorative text, repeated boilerplate, page numbers, footers
|
||||
|
||||
**Alice Traits for Format Flexibility:**
|
||||
```python
|
||||
.alice()
|
||||
.type("Form Assistant")
|
||||
.trait("Collects information content naturally, handling all formatting invisibly")
|
||||
.trait("Accepts format variations (SSN with/without hyphens)")
|
||||
.trait("Background: [extracted form knowledge goes here]")
|
||||
```
|
||||
|
||||
**Default to Direct Mapping:**
|
||||
PDF field_ids are internal - users only see `.desc()`. Use field IDs directly unless using fan-out patterns.
|
||||
|
||||
**Format Flexibility:**
|
||||
Never specify format in `.desc()` - Alice accepts variations. Use `.as_*()` for formatting requirements.
|
||||
|
||||
## Complete Example
|
||||
|
||||
```python
|
||||
from chatfield import chatfield
|
||||
|
||||
interview = (chatfield()
|
||||
.type("W9_TIN")
|
||||
.desc("Form to provide TIN to entities paying income")
|
||||
|
||||
.alice()
|
||||
.type("Tax Form Assistant")
|
||||
.trait("Collects information content naturally, handling all formatting invisibly")
|
||||
.trait("Accepts format variations (SSN with/without hyphens)")
|
||||
.trait("Background: W-9 used to provide TIN to entities paying income")
|
||||
.trait("Background: EIN for business entities, SSN for individuals")
|
||||
|
||||
.bob()
|
||||
.type("Taxpayer completing W-9 form")
|
||||
.trait("Speaks naturally and freely")
|
||||
|
||||
.field("name")
|
||||
.desc("What is your full legal name as shown on your tax return?")
|
||||
.hint("Background: Must match IRS records exactly")
|
||||
|
||||
.field("business_name")
|
||||
.desc("Business name or disregarded entity name, if different from above")
|
||||
.hint("Background: Optional - only if applicable")
|
||||
|
||||
.field("tin")
|
||||
.desc("What is your taxpayer identification number (SSN or EIN)?")
|
||||
.must("be exactly 9 digits")
|
||||
.must("indicate whether SSN or EIN")
|
||||
.as_str("ssn_part1", "First 3 digits of SSN, or empty if using EIN")
|
||||
.as_str("ssn_part2", "Middle 2 digits of SSN, or empty if using EIN")
|
||||
.as_str("ssn_part3", "Last 4 digits of SSN, or empty if using EIN")
|
||||
.as_str("ein_part1", "First 2 digits of EIN, or empty if using SSN")
|
||||
.as_str("ein_part2", "Last 7 digits of EIN, or empty if using SSN")
|
||||
|
||||
.field("address")
|
||||
.desc("What is your address (number, street, apt/suite)?")
|
||||
|
||||
.field("city_state_zip")
|
||||
.desc("What is your city, state, and ZIP code?")
|
||||
.as_str("city", "City name")
|
||||
.as_str("state", "State abbreviation (2 letters)")
|
||||
.as_str("zip", "ZIP code")
|
||||
|
||||
.build()
|
||||
)
|
||||
```
|
||||
|
||||
## Validation Checklist
|
||||
|
||||
Before proceeding, validate the interview definition:
|
||||
|
||||
<validation_checklist>
|
||||
```
|
||||
Interview Validation Checklist:
|
||||
- [ ] All field_ids from .form.json are mapped
|
||||
- [ ] No field_ids duplicated or missing
|
||||
- [ ] Re-representations (amount/amount_in_words, date/date_formatted, etc.) use single field with casts, not duplicate questions
|
||||
- [ ] .desc() describes WHAT information is needed (content), never HOW it should be formatted
|
||||
- [ ] .hint() provides context about content (e.g., "Optional", "Must match passport"), never formatting instructions
|
||||
- [ ] All formatting requirements (dates, codes, number formats, etc.) use .as_*() transformations exclusively
|
||||
- [ ] Fan-out patterns use .as_*() with PDF field_ids as cast names
|
||||
- [ ] Split patterns use .as_*() with "or empty/0 if N/A" descriptions
|
||||
- [ ] Discriminate + split uses .as_*() for mutually-exclusive fields
|
||||
- [ ] Expand pattern uses .as_*() casts on single field
|
||||
- [ ] .conclude() used only when necessary (multi-field dependencies)
|
||||
- [ ] Alice traits include extracted form knowledge
|
||||
- [ ] Field hints provide context from PDF instructions
|
||||
- [ ] Optional fields explicitly marked with hint("Background: Optional...")
|
||||
- [ ] .must() used sparingly (only true content requirements)
|
||||
- [ ] Field .desc() questions are natural and user-friendly (no technical field_ids)
|
||||
- [ ] ALL STRINGS match the PDF's primary language
|
||||
```
|
||||
</validation_checklist>
|
||||
|
||||
If any items fail:
|
||||
1. Review the specific issue
|
||||
2. Fix the interview definition
|
||||
3. Re-run validation checklist
|
||||
4. Proceed only when all items pass
|
||||
|
||||
## The Result: Form Data Model
|
||||
|
||||
When validation passes, you have successfully created the **Form Data Model** in `<basename>.chatfield/interview.py`.
|
||||
Reference in New Issue
Block a user