将重音字符转换为拉丁语而不会影响ElementTree

时间:2018-05-16 12:52:07

标签: python regex ascii encode elementtree

我正在试图弄清楚如何用他们的拉丁对应物(å替换所有重音字符(éía ...),我分别ei并尝试了几种方法,但他们都做了一些超出我理解范围的事情,这使得ElementTree以后无法使用.fromstring()进行转换。

我还必须逃避&符号,但我已经想通了。

相关语法:

# -- coding: utf-8 --

import xml.etree.ElementTree as ET
import os
import re

path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = re.sub(r"&(?!#\d{3};|amp;)", "&", line)
        xmlVal = ET.fromstring(line)

错误发生在最后一行,与其他解决方案一起抱有UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 161: ordinal not in range(128)或类似错误。

1 个答案:

答案 0 :(得分:1)

尝试使用 unidecode 模块

<强>实施例

import xml.etree.ElementTree as ET
import os
import re
import unidecode


path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = unidecode.unidecode(line)
        xmlVal = ET.fromstring(line)