如何在python中使用utf-8标准制作lxml输出文件

时间:2018-12-01 08:23:21

标签: python lxml

data.xml

<?xml version="1.0" encoding="UTF-8"?>
<ArticleSet>
    <Article>            
        <LastName>Bojarski</LastName>
        <ForeName>-</ForeName>
        <Affiliation>-</Affiliation>            
    </Article>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

示例代码

from lxml import etree

dom = etree.parse('data.xml')
root = dom.getroot()

for article in dom.xpath('Article[Affiliation="-"]'):
    root.remove(article)

dom.write('output.xml')

此代码删除隶属关系等于-即其隶属标记看起来像<Affliation>-</Affliation>的文章 当我将剩余的输出存储到output.xml中时,它将解析Unicode字符GençGen&#231;,我想按原样存储它。

代码输出

<ArticleSet>
    <Article>            
        <LastName>Gen&#231;</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

必需的输出

<ArticleSet>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

1 个答案:

答案 0 :(得分:0)

encoding方法中有一个etree.write参数。您也可以使用xml_declaration=True声明输出文档的编码。

dom.write('output.xml', encoding='utf-8', xml_declaration=True)

请参见lxml documentation