我正在解析包含嵌入另一个(<question>
)下的特定标记(<Turn>
)的xml文档,我需要检查结束标记后是否有文本{{ 1}}直到结束父标记</question>
。问题是</Turn>
和</question>
之间可能存在其他标记,或者换行符,空格,甚至是上述所有标记,因此仅检索问题的尾部是不够的。< / p>
以下是我正在处理的xml文件的一部分示例:
</Turn>
我在python中使用lxml处理xml。当我想检查<root>
<Turn speaker="spk2" startTime="5121.203" endTime="5136.265">
<question startline="8321" endline="8326">
<Sync time="5121.203"/>
some text
<Sync time="5126.531"/>
<Sync time="5127.662"/>
other text?</question><question startline="8326" endline="8326">
here are some other words?
</question>
<Sync time="5128.514"/>
THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT
<annoyingtag att="blah"/>
AND THIS TOO
</Turn>
<Turn>
<question>
this is a question?
</question>
this is not, I want to get this text.
</Turn>
<Turn>
There could be a turn with no question here.
</Turn>
<Turn>
<question>
and then another with a question?
</question>
followed by
<Sync/>
other text
but also
<Event/>
other tags
<Who/>
and I want to get all this text.
</Turn>
</root>
和</question>
之间是否有某些文字时,我已经处理了for循环处理问题,例如:
</Turn>
在这种情况下,我尝试使用Turns = rootnode.findall(".//Turn")
for Turn in Turns:
questions = Turn.findall(".//question")
for question in question:
if question == questions[-1]:
#This is where I will insert the code trying to find if there is some text following the question tag.
和另一种方法question.tail()
获取尾部,但在这两种情况下,我都看不到最后一个question.xpath("//text()")[1]
和{之间的所有文本{1}}(无论是否或部分内容)。
我也尝试在带有正则表达式的原始文件上执行此操作,但由于在两个结束标记之间可能出现很多内容,因此我最终得到了带有嵌套量词的正则表达式以及灾难性回溯问题。
答案 0 :(得分:0)
如果同步标记始终存在,则可能会有效:
xml = """<Turn speaker="spk2" startTime="5121.203" endTime="5136.265">
<question startline="8321" endline="8326">
<Sync time="5121.203"/>
some text
<Sync time="5126.531"/>
<Sync time="5127.662"/>
other text?</question><question startline="8326" endline="8326">
here are some other words?
</question>
<Sync time="5128.514"/>
THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT
<annoyingtag att="blah"/>
AND THIS TOO
</Turn>"""
from lxml.html import fromstring
xml = fromstring(xml)
print(xml.xpath("//question[last()]/following::sync/following::text()"))
哪会给你:
['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']
或者:
print(xml.xpath("//question[last()]/following::text()"))
这给了你:
['\n', '\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']
您还可以使用通配符:
print(xml.xpath("//question[last()]/following::*/following::text()"))
这又会给你:
['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']