python中的Xml解析器,无需删除标记

时间:2017-05-15 14:44:59

标签: python xml parsing text

为我的项目使用XML-parser但无法处理一个问题。

这是我的XML文件。我对几个元素感兴趣:句子,句子确定性和ccue。 XML

作为我想得到的理想输出: 确定性,这是确定的或不确定的 ccue,里面标签,和 整个句子(包括ccues - 包括或排除)。

我做了什么:     将xml.etree.ElementTree导入为ET

with open('myfile.xml', 'rt') as f:
tree = ET.parse(f)

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty')
    ccue = sentence.find('ccue')
    if certainty and (ccue is not None):
       print('  %s :: %s :: %s' % (certainty, sentence.text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))

但是在这种情况下,如果句子不确定而不完整,则从句子中删除ccues。一旦找到了ccue,find-function就会停止。所以,如果句子是:

<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>

它会告诉我:&#34;然而,&#34;作为一个句子。

任何人都可以帮我解决这个问题吗?而且你也可以帮我把结果保存为CSV - 这会很棒。

已更新 XML的例子:

<sentence certainty="certain" id="S1867.2">Left-wing Israelis are open to compromise on the issue, by means such as the monetary reparations and family reunification initiatives offered by Ehud Barak at the Camp David 2000 summit.</sentence>
<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>
<sentence certainty="certain" id="S1867.4">The HonestReporting organization listed the following grounds for this opposition: Palestinian flight from Israel was not compelled, but voluntary.</sentence>
<sentence certainty="uncertain" id="S1867.5">After seven Arab nations declared war on Israel in 1948, <ccue>many Arab leaders</ccue> encouraged Palestinians to flee, in order to make it easier to rout the Jewish state.</sentence>
<sentence certainty="certain" id="S1867.6">This point, however, is a matter of some contention.</sentence>

1 个答案:

答案 0 :(得分:2)

在XML中,文本可以分解为许多text()个节点。 ElementTree调用查找所有后代文本节点,以便将它们粘合在一起。关于如何处理文本节点周围的空白(它是真实文本的一部分还是简单地用于“漂亮打印”的装饰),XML存在歧义。你的例子有text <ccue>text<ccue> text(注意那里有太多空格)所以我剥离它们并添加我自己的空间。您可以根据需要调整该部分。

# let elementree open and figure out encoding
tree = ET.parse('myfile.xml')

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty', '')
    ccue = sentence.find('ccue')
    if certainty == "uncertain" and ccue:
       text = ' '.join(node.strip() for node in sentence.itertext())
       print('  %s :: %s :: %s' % (certainty, text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))