我有一个XML文件:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
</sentence>
<sentence id="1004293:1">
<text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
<Opinions>
</sentence>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>
</Review>
如何删除没有意见的句子?留下那些对文本有意见的句子? 我想得到这样的东西:
<sentences>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>
答案 0 :(得分:2)
我将使用此模块将xml转换为字典,例如:How to convert an xml string to a dictionary?,过滤掉不需要的节点,然后转换为xml。...
答案 1 :(得分:1)
请考虑使用XSLT(一种专用于转换XML文档的专用语言)。具体来说,先运行身份转换,然后根据需要在句子上运行一个空模板。
XSLT (另存为.xsl文件,特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- EMPTY TEMPLATE TO DELETE NODE(S) -->
<xsl:template match="sentence[text and not(Opinions/*)]"/>
</xsl:stylesheet>
Python (使用第三方模块lxml
)
import lxml.etree as et
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# TRANSFORM SOURCE DOC
result = transform(doc)
# OUTPUT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
答案 2 :(得分:1)
使用内置的XML库(ElementTree)。
注意:您发布的XML无效,我必须对其进行修复。
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions />
</sentence>
<sentence id="1004293:1">
<text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
<Opinions />
</sentence>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0" />
</Opinions>
</sentence>
</sentences>
</Review>
</Reviews>
'''
root = ET.fromstring(xml)
sentences_root = root.find('.//sentences')
sentences_with_no_opinions = [s for s in root.findall('.//sentence') if not s.find('.//Opinions')]
for s in sentences_with_no_opinions:
sentences_root.remove(s)
print(ET.tostring(root))
输出
<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:2">
<text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
<Opinions>
<Opinion category="SERVICE#GENERAL" from="0" polarity="negative" target="NULL" to="0" />
</Opinions>
</sentence>
</sentences>
</Review>
</Reviews>