在行

时间:2017-04-26 18:34:10

标签: python xml find

我正在尝试解析XML文件并从行中删除不必要的标记。我的循环被赶上了,并且不会在标签上应用第二个if语句,我不确定为什么......我一直盯着这个超过一个小时并测试新方法,但我一直得到错误ParseError: mismatched tag:。从调试开始,我可以告诉它甚至没有进入第二个if语句,但我的逻辑似乎应该如此。我知道我在这里错过了一些小东西,但无法弄清楚......有什么想法吗?

循环

with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
    outXML.write('<root>\n')
    for line in inXML.readlines():
        if (line.find("<sub>")):
            newline = line.replace("<sub>", "")
            newLine = newline.replace("</sub", "")
        elif (line.find("<sup>")):
            newline = line.replace("<sup>", "")
            newLine = newline.replace("</sup", "")

        outXML.write(re.sub('&[a-zA-Z]+;',anglicise,newLine))
    outXML.write('\n</root>')

要测试的XML

<pub>
    <ID>5010</ID>
    <title>Model-Checking for L<sub>2</sub</title>
    <year>1997</year>
    <booktitle>Universit&auml;t Trier, Mathematik/Informatik, Forschungsbericht</booktitle>
    <pages></pages>
    <authors>
        <author>Helmut Seidl</author>
    </authors>
</pub>
<pub>
    <ID>71035</ID>
    <title>S_2p \subseteq ZPP<sup>NP</sup</title>
    <year>2001</year>
    <booktitle>Electronic Colloquium on Computational Complexity (ECCC)</booktitle>
    <pages></pages>
    <authors>
        <author>Jin-yi Cai</author>
    </authors>
</pub>

1 个答案:

答案 0 :(得分:0)

谢谢@ juanpa.arrivillaga&amp; @BrenBarn,解决方案是在一行迭代上堆叠.replace()语句,如下所示:

with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
    outXML.write('<root>\n')
    for line in inXML.readlines():
        line = line.replace("<sub>", "").replace("</sub", "").replace("<sup>", "").replace("</sup", "")
        outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
    outXML.write('\n</root>')