zhongwei/gh-krfantasy-alsdiff-plugins-xml-element-extractor

Files

Zhongwei Li d6a8021492 Initial commit

2025-11-30 08:35:51 +08:00

7.1 KiB

Raw Permalink Blame History

XML Element Extraction - Python Implementation

This reference document explains the XML element extraction process used in the Python-based extraction script and provides troubleshooting guidance. The Python implementation provides better compatibility and handles special characters more reliably than the previous sed-based approach.

Core XML Processing Logic

Python ElementTree Parsing

The Python script uses the standard library xml.etree.ElementTree for robust XML parsing:

import xml.etree.ElementTree as ET

# Parse the XML file
tree = ET.parse(source_file)
root = tree.getroot()

Key advantages:

Proper XML parsing that understands structure and encoding
Handles special characters, quotes, and XML entities correctly
Provides structured access to XML elements and attributes
Better error handling and validation
Cross-platform compatibility using Python standard library

Tag Matching Algorithm

The script implements sophisticated tag matching to find exact element matches:

def find_element_by_tag_string(tree_root, element_tag):
    tag_name = extract_tag_name(element_tag)

    # Find all elements with the matching tag name
    for element in tree_root.iter(tag_name):
        # Reconstruct the opening tag string with attributes
        if element.attrib:
            attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(element.attrib.items()))
            constructed_tag = f'<{tag_name} {attrs}>'
        else:
            constructed_tag = f'<{tag_name}>'

        # Compare normalized tags for case-insensitive attribute order matching
        if normalize_tag(constructed_tag) == normalize_tag(element_tag):
            return element

Tag Name Extraction

The script extracts tag names using regular expressions:

def extract_tag_name(element_tag):
    match = re.match(r'<\s*([^>\s]+)', element_tag)
    if match:
        return match.group(1)
    return None

Examples:

<InstrumentVector Id="0"> → InstrumentVector
<MidiTrack> → MidiTrack
<ComplexTag attr1="value1" attr2="value2"> → ComplexTag

Tag Normalization

To handle different attribute orders and whitespace variations:

def normalize_tag(tag):
    # Parse the tag to normalize it
    match = re.match(r'<\s*([^>\s]+)(.*)>', tag)
    if not match:
        return tag

    tag_name = match.group(1)
    attrs_str = match.group(2).strip()

    # Parse attributes and sort them for consistent comparison
    attrs = {}
    for attr_match in re.finditer(r'(\w+)\s*=\s*"([^"]*)"', attrs_str):
        attr_name = attr_match.group(1)
        attr_value = attr_match.group(2)
        attrs[attr_name] = attr_value

    # Reconstruct with sorted attributes
    if attrs:
        sorted_attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(attrs.items()))
        return f'<{tag_name} {sorted_attrs}>'
    else:
        return f'<{tag_name}>'

Troubleshooting Guide

Common Issues and Solutions

1. No Matching Element Found

Symptoms:

Empty destination file
"No matching element found or extraction failed" error

Possible Causes:

Incorrect tag spelling or case sensitivity
Missing or extra whitespace in tag
Attributes don't match exactly
Tag contains special characters needing escaping

Solutions:

Verify exact tag spelling and case
Check for exact attribute matching
Use quotes around special characters in attributes
Validate source file contains the expected element

2. Multiple Elements Extracted

Symptoms:

Output contains more than one element
Unexpected content in destination file

Possible Causes:

Source file has nested identical elements
Closing tag matching is ambiguous

Solutions:

The script should handle this with the "first element only" pattern
If issue persists, check source file structure
Consider using more specific attributes in opening tag

3. Special Characters in Tags

Symptoms:

Sed syntax errors
Failed pattern matching

Common Special Characters:

Quotes (single or double)
Ampersands (&)
Greater/less than signs within attributes
Unicode characters

Solutions:

Properly quote the element tag when calling the script
Escape special characters if needed
Use exact character encoding from source file

4. File Permission Issues

Symptoms:

"Source file does not exist" error
"Source file is not readable" error

Solutions:

Verify file path is correct
Check file permissions: ls -la source.xml
Ensure read permissions: chmod +r source.xml
Check directory permissions if file is in subdirectory

Debugging Tips

Test XML Patterns Manually

Before using the script, test XML patterns:

# Test opening tag detection
grep -n "<InstrumentVector" source.xml

# Test element extraction with Python (single line test)
python3 -c "
import xml.etree.ElementTree as ET
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.get('Id') == '0':
        print('Found element with Id=0')
        break
"

# Test tag name extraction
echo '<InstrumentVector Id="0">' | python3 -c "
import sys, re
line = sys.stdin.read().strip()
match = re.match(r'<\s*([^>\s]+)', line)
if match: print(match.group(1))
"

Validate XML Structure

# Check if XML is well-formed
xmllint --noout source.xml

# Pretty-print XML to understand structure
xmllint --format source.xml | head -50

Check Element Count

# Count occurrences of specific element
grep -c "<InstrumentVector" source.xml
grep -c "</InstrumentVector>" source.xml

Advanced Usage

Customizing for Specific XML Structures

For complex XML structures, you can extend the Python script or use additional Python logic:

Nested Elements:

# Extract only direct children, not nested ones
import xml.etree.ElementTree as ET

tree = ET.parse('source.xml')
parent = tree.find('.//Parent')
if parent:
    for child in parent:
        if child.tag == 'Child':
            print(ET.tostring(child).decode())

Multiple Attributes:

# More specific matching with multiple attributes
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.get('Id') == '0' and elem.get('Type') == 'Audio':
        print(ET.tostring(elem).decode())
        break

Conditional Extraction:

# Extract only if element contains specific content
tree = ET.parse('source.xml')
for elem in tree.iter('InstrumentVector'):
    if elem.findtext('.//SpecificContent') is not None:
        print(ET.tostring(elem).decode())
        break

Performance Considerations

For large XML files:

Consider using XML-specific tools like xmlstarlet
Process files in chunks if memory is limited
Use more specific patterns to reduce processing time

Alternative Tools

For more complex XML processing needs:

xmlstarlet: XML-specific command-line tool with XPath support
xmllint: More robust XML processing and validation
Python with lxml: For advanced XML manipulation and XPath
xpath: Command-line XPath-based extraction
Python BeautifulSoup: For HTML/XML parsing with tolerance for malformed documents

7.1 KiB Raw Permalink Blame History

XML Element Extraction - Python Implementation

Core XML Processing Logic

Python ElementTree Parsing

Tag Matching Algorithm

Tag Name Extraction

Tag Normalization

Troubleshooting Guide

Common Issues and Solutions

1. No Matching Element Found

2. Multiple Elements Extracted

3. Special Characters in Tags

4. File Permission Issues

Debugging Tips

Test XML Patterns Manually

Validate XML Structure

Check Element Count

Advanced Usage

Customizing for Specific XML Structures

Performance Considerations

Alternative Tools

7.1 KiB

Raw Permalink Blame History