迭代lxml etree中的文本和元素

时间:2014-06-05 22:13:58

标签: python lxml elementtree

假设我有以下XML文档:

<species>
    Mammals: <dog/> <cat/>
    Reptiles: <snake/> <turtle/>
    Birds: <seagull/> <owl/>
</species>

然后我得到species这样的元素:

import lxml.etree
doc = lxml.etree.fromstring(xml)
species = doc.xpath('/species')[0]

现在我想列出按物种分组的动物清单。我怎么能用ElementTree API来做呢?

2 个答案:

答案 0 :(得分:6)

如果枚举所有节点,您将看到一个文本节点,该类后面跟有物种的元素节点:

>>> for node in species.xpath("child::node()"):
...     print type(node), node
... 
<class 'lxml.etree._ElementStringResult'> 
    Mammals: 
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'> 
    Reptiles: 
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'> 
    Birds: 
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'> 

所以你可以从那里建立它:

my_species = {}
current_class = None
for node in species.xpath("child::node()"):
    if isinstance(node, lxml.etree._ElementStringResult):
        text = node.strip(' \n\t:')
        if text:
            current_class = my_species.setdefault(text, [])
    elif isinstance(node, lxml.etree._Element):
        if current_class is not None:
            current_class.append(node.tag)
print my_species

结果

{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}

这一切都很脆弱......文本节点排列方式的微小变化可能会破坏解析。

答案 1 :(得分:2)

设计说明

@tdelaney的答案基本上是正确的,但我想指出Python元素树API的一个细微差别。这是the lxml tutorial的引用:

  

元素可以包含文字:

<root>TEXT</root>
     

在许多XML文档(以数据为中心的文档)中,这是唯一可以找到文本的地方。它由树层次结构底部的叶标记封装。

     

但是,如果XML用于标记文本文档,例如(X)HTML,则文本也可以出现在树的中间的不同元素之间:

<html><body>Hello<br/>World</body></html>
     

此处,<br/>标记被文字包围。这通常被称为文档样式或混合内容XML。 Elements通过tail属性支持此功能。它包含直接跟在元素后面的文本,直到XML树中的下一个元素。

     

两个属性texttail足以表示XML文档中的任何文本内容。这样,ElementTree API 除了Element类之外不需要任何特殊的文本节点,这些节点往往会相当频繁(正如您可能从经典DOM API中了解到的那样)。

实施

考虑到这些属性,可以在不强制树输出文本节点的情况下检索文档文本。

#!/usr/bin/env python3.3


import itertools
from pprint import pprint

try:
  from lxml import etree
except ImportError:
  from xml.etree import cElementTree as etree


def textAndElement(node):
  '''In py33+ recursive generators are easy'''

  yield node

  text = node.text.strip() if node.text else None
  if text:
    yield text

  for child in node:
    yield from textAndElement(child)

  tail = node.tail.strip() if node.tail else None
  if tail:
    yield tail


if __name__ == '__main__':
  xml = '''
    <species>
      Mammals: <dog/> <cat/>
      Reptiles: <snake/> <turtle/>
      Birds: <seagull/> <owl/>
    </species>
  '''
  doc = etree.fromstring(xml)

  pprint(list(textAndElement(doc)))
  #[<Element species at 0x7f2c538727d0>,
  #'Mammals:',
  #<Element dog at 0x7f2c538728c0>,
  #<Element cat at 0x7f2c53872910>,
  #'Reptiles:',
  #<Element snake at 0x7f2c53872960>,
  #<Element turtle at 0x7f2c538729b0>,
  #'Birds:',
  #<Element seagull at 0x7f2c53872a00>,
  #<Element owl at 0x7f2c53872a50>]

  gen = textAndElement(doc)
  next(gen) # skip root
  groups = []
  for _, g in itertools.groupby(gen, type):
    groups.append(tuple(g))

  pprint(dict(zip(*[iter(groups)] * 2)) )
  #{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
  #               <Element owl at 0x7fc37f38a820>),
  #('Mammals:',): (<Element dog at 0x7fc37f38a960>,
  #                <Element cat at 0x7fc37f38a9b0>),
  #('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
  #                <Element turtle at 0x7fc37f38aa50>)}