gh-krfantasy-alsdiff-plugin…/skills/xml-element-extractor/reference.md

# XML Element Extraction - Python Implementation

This reference document explains the XML element extraction process used in the Python-based extraction script and provides troubleshooting guidance. The Python implementation provides better compatibility and handles special characters more reliably than the previous sed-based approach.

## Core XML Processing Logic

### Python ElementTree Parsing

The Python script uses the standard library `xml.etree.ElementTree` for robust XML parsing:

```python
import xml.etree.ElementTree as ET

# Parse the XML file
tree = ET.parse(source_file)
root = tree.getroot()
```

**Key advantages:**
- Proper XML parsing that understands structure and encoding
- Handles special characters, quotes, and XML entities correctly
- Provides structured access to XML elements and attributes
- Better error handling and validation
- Cross-platform compatibility using Python standard library

### Tag Matching Algorithm

The script implements sophisticated tag matching to find exact element matches:

```python
def find_element_by_tag_string(tree_root, element_tag):
    tag_name = extract_tag_name(element_tag)

    # Find all elements with the matching tag name
    for element in tree_root.iter(tag_name):
        # Reconstruct the opening tag string with attributes
        if element.attrib:
            attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(element.attrib.items()))
            constructed_tag = f'<{tag_name} {attrs}>'
        else:
            constructed_tag = f'<{tag_name}>'

        # Compare normalized tags for case-insensitive attribute order matching
        if normalize_tag(constructed_tag) == normalize_tag(element_tag):
            return element
```

### Tag Name Extraction

The script extracts tag names using regular expressions:

```python
def extract_tag_name(element_tag):
    match = re.match(r'<\s*([^>\s]+)', element_tag)
    if match:
        return match.group(1)
    return None
```

**Examples:**
- `<InstrumentVector Id="0">` → `InstrumentVector`
- `<MidiTrack>` → `MidiTrack`
- `<ComplexTag attr1="value1" attr2="value2">` → `ComplexTag`

### Tag Normalization

To handle different attribute orders and whitespace variations:

```python
def normalize_tag(tag):
    # Parse the tag to normalize it
    match = re.match(r'<\s*([^>\s]+)(.*)>', tag)
    if not match:
        return tag

    tag_name = match.group(1)
    attrs_str = match.group(2).strip()

    # Parse attributes and sort them for consistent comparison
    attrs = {}
    for attr_match in re.finditer(r'(\w+)\s*=\s*"([^"]*)"', attrs_str):
        attr_name = attr_match.group(1)
        attr_value = attr_match.group(2)
        attrs[attr_name] = attr_value

    # Reconstruct with sorted attributes
    if attrs:
        sorted_attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(attrs.items()))
        return f'<{tag_name} {sorted_attrs}>'
    else:
        return f'<{tag_name}>'
```

## Troubleshooting Guide

### Common Issues and Solutions

#### 1. No Matching Element Found

**Symptoms:**
- Empty destination file
- "No matching element found or extraction failed" error

**Possible Causes:**
- Incorrect tag spelling or case sensitivity
- Missing or extra whitespace in tag
- Attributes don't match exactly
- Tag contains special characters needing escaping

**Solutions:**
- Verify exact tag spelling and case
- Check for exact attribute matching
- Use quotes around special characters in attributes
- Validate source file contains the expected element

#### 2. Multiple Elements Extracted

**Symptoms:**
- Output contains more than one element
- Unexpected content in destination file

**Possible Causes:**
- Source file has nested identical elements
- Closing tag matching is ambiguous

**Solutions:**
- The script should handle this with the "first element only" pattern
- If issue persists, check source file structure
- Consider using more specific attributes in opening tag

#### 3. Special Characters in Tags

**Symptoms:**
- Sed syntax errors
- Failed pattern matching

**Common Special Characters:**
- Quotes (single or double)
- Ampersands (&)
- Greater/less than signs within attributes
- Unicode characters

**Solutions:**
- Properly quote the element tag when calling the script
- Escape special characters if needed
- Use exact character encoding from source file

#### 4. File Permission Issues

**Symptoms:**
- "Source file does not exist" error
- "Source file is not readable" error

**Solutions:**
- Verify file path is correct
- Check file permissions: `ls -la source.xml`
- Ensure read permissions: `chmod +r source.xml`
- Check directory permissions if file is in subdirectory

### Debugging Tips

#### Test XML Patterns Manually

Before using the script, test XML patterns:

```bash
# Test opening tag detection
grep -n "<InstrumentVector" source.xml

# Test element extraction with Python (single line test)
python3 -c "
import xml.etree.ElementTree as ET
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.get('Id') == '0':
        print('Found element with Id=0')
        break
"

# Test tag name extraction
echo '<InstrumentVector Id="0">' | python3 -c "
import sys, re
line = sys.stdin.read().strip()
match = re.match(r'<\s*([^>\s]+)', line)
if match: print(match.group(1))
"
```

#### Validate XML Structure

```bash
# Check if XML is well-formed
xmllint --noout source.xml

# Pretty-print XML to understand structure
xmllint --format source.xml | head -50
```

#### Check Element Count

```bash
# Count occurrences of specific element
grep -c "<InstrumentVector" source.xml
grep -c "</InstrumentVector>" source.xml
```

## Advanced Usage

### Customizing for Specific XML Structures

For complex XML structures, you can extend the Python script or use additional Python logic:

**Nested Elements:**
```python
# Extract only direct children, not nested ones
import xml.etree.ElementTree as ET

tree = ET.parse('source.xml')
parent = tree.find('.//Parent')
if parent:
    for child in parent:
        if child.tag == 'Child':
            print(ET.tostring(child).decode())
```

**Multiple Attributes:**
```python
# More specific matching with multiple attributes
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.get('Id') == '0' and elem.get('Type') == 'Audio':
        print(ET.tostring(elem).decode())
        break
```

**Conditional Extraction:**
```python
# Extract only if element contains specific content
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.findtext('.//SpecificContent') is not None:
        print(ET.tostring(elem).decode())
        break
```

### Performance Considerations

For large XML files:
- Consider using XML-specific tools like `xmlstarlet`
- Process files in chunks if memory is limited
- Use more specific patterns to reduce processing time

## Alternative Tools

For more complex XML processing needs:
- `xmlstarlet`: XML-specific command-line tool with XPath support
- `xmllint`: More robust XML processing and validation
- Python with `lxml`: For advanced XML manipulation and XPath
- `xpath`: Command-line XPath-based extraction
- Python `BeautifulSoup`: For HTML/XML parsing with tolerance for malformed documents