我想检索XML文件中特定元素的内容。但是,在XML元素中,还有其他XML元素会破坏父标记内容的正确提取。一个例子:
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''
context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
print element.text
导致:
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None
然而,例如,错过了“使用保护制服......”。似乎,“索赔 - 文本”的每个元素都有其他内在元素被忽略了。我应该如何更改XML的解析以获取所有声明?
由于
我刚刚使用'普通'SAX解析器方法解决了这个问题:
class SimpleXMLHandler(object):
def __init__(self):
self.buffer = ''
self.claim = 0
def start(self, tag, attributes):
if tag == 'claim-text':
if self.claim == 0:
self.buffer = ''
self.claim = 1
def data(self, data):
if self.claim == 1:
self.buffer += data
def end(self, tag):
if tag == 'claim-text':
print self.buffer
self.claim = 0
def close(self):
pass
答案 0 :(得分:3)
您可以使用xpath直接查找并连接每个<claim-text>
节点下的所有文本节点,如下所示:
from StringIO import StringIO
from lxml import etree
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''
context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text')
for event, element in context:
print ''.join(element.xpath('text()'))
输出:
. A protective uniform for use by a person in combat or law enforcement, said uniform comprising:
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;