为我的项目使用XML-parser但无法处理一个问题。
这是我的XML文件。我对几个元素感兴趣:句子,句子确定性和ccue。
作为我想得到的理想输出: 确定性,这是确定的或不确定的 ccue,里面标签,和 整个句子(包括ccues - 包括或排除)。
我做了什么: 将xml.etree.ElementTree导入为ET
with open('myfile.xml', 'rt') as f:
tree = ET.parse(f)
for sentence in tree.iter('sentence'):
certainty = sentence.attrib.get('certainty')
ccue = sentence.find('ccue')
if certainty and (ccue is not None):
print(' %s :: %s :: %s' % (certainty, sentence.text, ccue.text))
else:
print(' %s ::,:: %s' % (certainty,sentence.text))
但是在这种情况下,如果句子不确定而不完整,则从句子中删除ccues。一旦找到了ccue,find-function就会停止。所以,如果句子是:
<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>
它会告诉我:&#34;然而,&#34;作为一个句子。
任何人都可以帮我解决这个问题吗?而且你也可以帮我把结果保存为CSV - 这会很棒。
已更新 XML的例子:
<sentence certainty="certain" id="S1867.2">Left-wing Israelis are open to compromise on the issue, by means such as the monetary reparations and family reunification initiatives offered by Ehud Barak at the Camp David 2000 summit.</sentence>
<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>
<sentence certainty="certain" id="S1867.4">The HonestReporting organization listed the following grounds for this opposition: Palestinian flight from Israel was not compelled, but voluntary.</sentence>
<sentence certainty="uncertain" id="S1867.5">After seven Arab nations declared war on Israel in 1948, <ccue>many Arab leaders</ccue> encouraged Palestinians to flee, in order to make it easier to rout the Jewish state.</sentence>
<sentence certainty="certain" id="S1867.6">This point, however, is a matter of some contention.</sentence>
答案 0 :(得分:2)
在XML中,文本可以分解为许多text()
个节点。 ElementTree
调用查找所有后代文本节点,以便将它们粘合在一起。关于如何处理文本节点周围的空白(它是真实文本的一部分还是简单地用于“漂亮打印”的装饰),XML存在歧义。你的例子有text <ccue>text<ccue> text
(注意那里有太多空格)所以我剥离它们并添加我自己的空间。您可以根据需要调整该部分。
# let elementree open and figure out encoding
tree = ET.parse('myfile.xml')
for sentence in tree.iter('sentence'):
certainty = sentence.attrib.get('certainty', '')
ccue = sentence.find('ccue')
if certainty == "uncertain" and ccue:
text = ' '.join(node.strip() for node in sentence.itertext())
print(' %s :: %s :: %s' % (certainty, text, ccue.text))
else:
print(' %s ::,:: %s' % (certainty,sentence.text))