Question

我正在解析包含嵌入另一个（<question>）下的特定标记（<Turn>）的xml文档，我需要检查结束标记后是否有文本{{ 1}}直到结束父标记</question>。问题是</Turn>和</question>之间可能存在其他标记，或者换行符，空格，甚至是上述所有标记，因此仅检索问题的尾部是不够的。< / p>

以下是我正在处理的xml文件的一部分示例：

</Turn>

我在python中使用lxml处理xml。当我想检查<root> <Turn speaker="spk2" startTime="5121.203" endTime="5136.265"> <question startline="8321" endline="8326"> <Sync time="5121.203"/> some text <Sync time="5126.531"/> <Sync time="5127.662"/> other text?</question><question startline="8326" endline="8326"> here are some other words? </question> <Sync time="5128.514"/> THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT <annoyingtag att="blah"/> AND THIS TOO </Turn> <Turn> <question> this is a question? </question> this is not, I want to get this text. </Turn> <Turn> There could be a turn with no question here. </Turn> <Turn> <question> and then another with a question? </question> followed by <Sync/> other text but also <Event/> other tags <Who/> and I want to get all this text. </Turn> </root>和</question>之间是否有某些文字时，我已经处理了for循环处理问题，例如：

</Turn>

在这种情况下，我尝试使用Turns = rootnode.findall(".//Turn") for Turn in Turns: questions = Turn.findall(".//question") for question in question: if question == questions[-1]: #This is where I will insert the code trying to find if there is some text following the question tag.和另一种方法question.tail()获取尾部，但在这两种情况下，我都看不到最后一个question.xpath("//text()")[1]和{之间的所有文本{1}}（无论是否或部分内容）。

我也尝试在带有正则表达式的原始文件上执行此操作，但由于在两个结束标记之间可能出现很多内容，因此我最终得到了带有嵌套量词的正则表达式以及灾难性回溯问题。

Answer 1

如果同步标记始终存在，则可能会有效：

xml = """<Turn speaker="spk2" startTime="5121.203" endTime="5136.265">
<question startline="8321" endline="8326">
<Sync time="5121.203"/>
some text
<Sync time="5126.531"/>
<Sync time="5127.662"/>
other text?</question><question startline="8326" endline="8326">
here are some other words?
</question>
<Sync time="5128.514"/>
THIS IS SOME TEXT I WANT TO GET <anothertag att="2"/> SOME OTHER TEXT
<annoyingtag att="blah"/>
AND THIS TOO
</Turn>"""

from lxml.html import fromstring

xml = fromstring(xml)

print(xml.xpath("//question[last()]/following::sync/following::text()"))

哪会给你：

['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']

或者：

print(xml.xpath("//question[last()]/following::text()"))

这给了你：

['\n', '\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']

您还可以使用通配符：

 print(xml.xpath("//question[last()]/following::*/following::text()"))

这又会给你：

['\nTHIS IS SOME TEXT I WANT TO GET ', ' SOME OTHER TEXT\n', '\nAND THIS TOO\n']

lxml检查元素后是否存在文本（不仅仅是尾部）

1 个答案: