我正在尝试解析XML文件并从行中删除不必要的标记。我的循环被赶上了,并且不会在标签上应用第二个if
语句,我不确定为什么......我一直盯着这个超过一个小时并测试新方法,但我一直得到错误ParseError: mismatched tag:
。从调试开始,我可以告诉它甚至没有进入第二个if语句,但我的逻辑似乎应该如此。我知道我在这里错过了一些小东西,但无法弄清楚......有什么想法吗?
循环
with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
outXML.write('<root>\n')
for line in inXML.readlines():
if (line.find("<sub>")):
newline = line.replace("<sub>", "")
newLine = newline.replace("</sub", "")
elif (line.find("<sup>")):
newline = line.replace("<sup>", "")
newLine = newline.replace("</sup", "")
outXML.write(re.sub('&[a-zA-Z]+;',anglicise,newLine))
outXML.write('\n</root>')
要测试的XML
<pub>
<ID>5010</ID>
<title>Model-Checking for L<sub>2</sub</title>
<year>1997</year>
<booktitle>Universität Trier, Mathematik/Informatik, Forschungsbericht</booktitle>
<pages></pages>
<authors>
<author>Helmut Seidl</author>
</authors>
</pub>
<pub>
<ID>71035</ID>
<title>S_2p \subseteq ZPP<sup>NP</sup</title>
<year>2001</year>
<booktitle>Electronic Colloquium on Computational Complexity (ECCC)</booktitle>
<pages></pages>
<authors>
<author>Jin-yi Cai</author>
</authors>
</pub>
答案 0 :(得分:0)
谢谢@ juanpa.arrivillaga&amp; @BrenBarn,解决方案是在一行迭代上堆叠.replace()语句,如下所示:
with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
outXML.write('<root>\n')
for line in inXML.readlines():
line = line.replace("<sub>", "").replace("</sub", "").replace("<sup>", "").replace("</sup", "")
outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
outXML.write('\n</root>')