Question

为我的项目使用XML-parser但无法处理一个问题。

这是我的XML文件。我对几个元素感兴趣：句子，句子确定性和ccue。 XML

作为我想得到的理想输出：确定性，这是确定的或不确定的 ccue，里面标签，和整个句子（包括ccues - 包括或排除）。

我做了什么：将xml.etree.ElementTree导入为ET

with open('myfile.xml', 'rt') as f:
tree = ET.parse(f)

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty')
    ccue = sentence.find('ccue')
    if certainty and (ccue is not None):
       print('  %s :: %s :: %s' % (certainty, sentence.text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))

但是在这种情况下，如果句子不确定而不完整，则从句子中删除ccues。一旦找到了ccue，find-function就会停止。所以，如果句子是：

<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>

它会告诉我：＆＃34;然而，＆＃34;作为一个句子。

任何人都可以帮我解决这个问题吗？而且你也可以帮我把结果保存为CSV - 这会很棒。

已更新 XML的例子：

<sentence certainty="certain" id="S1867.2">Left-wing Israelis are open to compromise on the issue, by means such as the monetary reparations and family reunification initiatives offered by Ehud Barak at the Camp David 2000 summit.</sentence>
<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>
<sentence certainty="certain" id="S1867.4">The HonestReporting organization listed the following grounds for this opposition: Palestinian flight from Israel was not compelled, but voluntary.</sentence>
<sentence certainty="uncertain" id="S1867.5">After seven Arab nations declared war on Israel in 1948, <ccue>many Arab leaders</ccue> encouraged Palestinians to flee, in order to make it easier to rout the Jewish state.</sentence>
<sentence certainty="certain" id="S1867.6">This point, however, is a matter of some contention.</sentence>

Answer 1

在XML中，文本可以分解为许多text()个节点。 ElementTree调用查找所有后代文本节点，以便将它们粘合在一起。关于如何处理文本节点周围的空白（它是真实文本的一部分还是简单地用于“漂亮打印”的装饰），XML存在歧义。你的例子有text <ccue>text<ccue> text（注意那里有太多空格）所以我剥离它们并添加我自己的空间。您可以根据需要调整该部分。

# let elementree open and figure out encoding
tree = ET.parse('myfile.xml')

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty', '')
    ccue = sentence.find('ccue')
    if certainty == "uncertain" and ccue:
       text = ' '.join(node.strip() for node in sentence.itertext())
       print('  %s :: %s :: %s' % (certainty, text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))

python中的Xml解析器，无需删除标记

1 个答案: