Tuesday, September 13, 2011

lxml and removing nodes

The lxml library for Python represents a really effective tool for parsing and manipulating XML-based data. You can manipulate the XML documents to deal with the W3C standards for Inclusive and Exclusive Canonicalization, which deals with all messy details of adjusting namespaces as you extract sections of the data.

XML is inherently a difficult data structure to manipulate. The white spaces, return lines, and new lines make a big difference in validating signatures and/or digest values. If you accidentally miss a character in your text manipulation, perform the wrong canonicalization, etc. your one-way SHA hash can easily be affected, causing you to be unable to verify the signature of the data.

One of the idiosyncracies of the lxml library, described best in this lxml document, is that the internal data structures are stored as Element objects with a .text and .tail property. The .text represents all the underlying value within the tag, while the .tail property represents the text between tags. This data structure differs from the DOM-model in that the text after an element is represented by the parent. For example, consider this XML-structure:

<a>aTEXT
  <b>bTEXT</b>bTAIL
</a>aTAIL

This can be represented with the following lxml code:

import etree

a = etree.Element('a')
a.text = "aTEXT"
a.tail = "aTAIL"

b = SubElement(a, 'b')
b.text = "bTEXT"
b.tail = "bTAIL"

What happens if you remove the 'b' node? Ideally, the text with the 'b' tag disappears, while the bTAIL gets moved up. The structure would look like the following:

<a>aTEXTbTAIL</a>aTAIL

The command to remove the lxml node would be:
a.remove(b)

Upon making this change, however, it appears in lxml v2.3, the output appeared as: <a>aTEXT</a>aTAIL</a>

In order to understand what's going on, I had to download the source for the lxml, install the Cython library that converts the .pyx code to .C bindings, recompile, and link the new etree.so binary. If you're curious, the instructions for doing so here are posted here.

Upon inspecting the etree.pyx, I noticed the code to move the tail occured after unlinking the node. What we really wanted is that the tail to be moved before the node is unlinked. Otherwise, the information about the tail would also be potentially be removed, which may have explained why the tail was never copied.

def remove(self, _Element element not None):
-        tree.xmlUnlinkNode(c_node)
         _moveTail(c_next, c_node)
+        tree.xmlUnlinkNode(c_node)

Examining the _moveTail code also points to something interesting. The .tail is represented internally by XML-based text-based nodes, which are siblings of the current node (denoted by the .next pointer). Text nodes are also XML-based text-nodes, but appear to be children of the node. There is a loop that traverses the linked list of nodes, such that there can be multiple text-nodes, which could could happen if multiple subelements were removed, and you were left with a chain of XML-based .tail nodes.

cdef void _moveTail(xmlNode* c_tail, xmlNode* c_target):
    cdef xmlNode* c_next
    # tail support: look for any text nodes trailing this node and
    # move them too
    c_tail = _textNodeOrSkip(c_tail)
    while c_tail is not NULL:
        c_next = _textNodeOrSkip(c_tail.next)
        tree.xmlUnlinkNode(c_tail)
        tree.xmlAddNextSibling(c_target, c_tail)
        c_target = c_tail
        c_tail = c_next

Upon fixing this code, the text_xinclude_test started failing. If I recompiled and reverted back to the original etree.pyx, the test passed fine. One even more unusual aspect was the invocation of the self.include(), which appeared to be overriden depending on whether the lxml library would rely on the native implementation of the xinclude() routine, or rely on its Python-based version that allows external URL's to referenced in ElementInclude.py.

def test_xinclude_text(self):
        filename = fileInTestDir('test_broken.xml')
        root = etree.XML(_bytes('''\
        
          
        
        ''' % filename))
        old_text = root.text
        content = read_file(filename)
        old_tail = root[0].tail

        self.include( etree.ElementTree(root) )
        self.assertEquals(old_text + content + old_tail,
                          root.text)

The test_xinclude_text() is a routine to verify that one can use <:xi:include> directives to incorporate other files within an XML-document. When such a tag is discovered, the contents of the file is read (in this case, the contents of test_broken.xml) and the entire node is substituted with this text. The parent node's .text property will then be set and the <xi:include> is removed.

It appears that code within the ElementInclude.py the text appeared to mask this issue by appending the tail before removing it:

@@ -204,7 +204,8 @@ def _include(elem, loader=None, _parent_hrefs=None, base_url=None):
                 elif parent is None:
                     return text # replaced the root node!
                 else:
-                    parent.text = (parent.text or "") + text + (e.tail or "")
+                    parent.text = (parent.text or "") + text 
                 parent.remove(e)

The entire pull request for this fix is located here:

https://github.com/lxml/lxml/pull/14/files

Update on this PR:

Note that this is a deliberate design choice. It will not change.

http://lxml.de/FAQ.html#what-about-that-trailing-text-on-serialised-elements

http://lxml.de/tutorial.html#elements-contain-text

In other words, if you remove a subelement, you have to take care of the .tail and move it to the right tag. The lxml library will not change so this PR request was rejected.

5 comments:

  1. Hi, can you tell me how to remove HTML-formating tags such as 'i', 's' or 'em', but preserve the text?

    For example, this:
    1st i 1st b1st em
    should become
    1st i 1st b 1st em

    Greetings

    ReplyDelete
  2. > In other words, if you remove a subelement, you have to take care of the .tail and move it to the right tag.

    Please tell me how to do this. I've tried severall times, but it doesn't what I want.

    Sorry to ask you again,
    afix

    ReplyDelete
  3. I know this is old, but for anyone looking, use the drop_tree() method, documented on this page:

    http://lxml.de/lxmlhtml.html

    ReplyDelete
    Replies
    1. unfortunately, the issues is not limited to just 'removing' or 'dropping subtrees', its a major pain when moving nodes/subtrees around.
      IMHO, I find it a real shame because this issue renders a great (fast) library rather unusable -- unless I do what Roger did in this post to set things right for myself before I use it..

      Delete
  4. It doesn't appear to remove the tail text.

    ReplyDelete