我尝试在Node下获取文字和ID,请参阅此处的示例文件:example.xml
但是,它没有像普通XML文件那样的结构。结构如下:
<TextWithNodes><Node id="0"/>
<Node id="1"/>
<Node id="2"/>9407011<Node id="9"/>
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/>
<Node id="19"/> Lg.Pr.Dc <Node id="29"/>
我想要的输出是start_node
,end_node
和text_between_node
的列表。我不确定我是否可以使用lxml
库来执行此操作。
目前,我使用
from lxml import etree
tree = etree.parse('9407011.az-scixml.xml')
nodes = tree.xpath('//TextWithNodes')[0].getchildren()
node = nodes[0] # example one node
print(node.text) # this give empty string because you don't have closing same id
答案 0 :(得分:1)
使用XPath可能对您有用。将from lxml import etree as ET
root = ET.XML(b'''<?xml version='1.0' encoding='UTF-8'?>
<GateDocument version="3">
<TextWithNodes><Node id="0"/>
<Node id="1"/>
<Node id="2"/>9407011<Node id="9"/>
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/>
<Node id="19"/> Lg.Pr.Dc <Node id="29"/>
</TextWithNodes></GateDocument>''')
# Grab each 'Node' element:
# Only if the element has an 'id' attribute, and only if
# the first sibling is a text node that isn't
# all wihtespace and only if
# the second sibling is a 'Node' with an 'id'
for r in root.xpath('''//Node[@id]
[following-sibling::node()
[1]
[self::text()]
[normalize-space() != ""]]
[following-sibling::node()
[2]
[self::Node[@id]]]'''):
# All elements that satisfy that above XPath should
# also satisfy the requirements for the next line
print (r.get('id'), repr(r.tail), r.getnext().get('id'))
与空字符串进行比较将消除没有后续文本的节点。
这可能对您有用:
{{1}}