Question

我想用另一个utf-8字符集替换一些utf-8个字符，但我尝试的任何内容都会导致错误。

我是Python的菜鸟，所以请耐心等待

我想要实现的是通过unicode值或html实体转换字符（更易读，用于维护）

尝试（带示例）：

1。第一

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#Found this function
def multiple_replace(dic, text): 
    pattern = "|".join(map(re.escape, dic.keys()))
    return re.sub(pattern, lambda m: dic[m.group()], text)

text="Larry Wall is ùm© some text"
replace_table = {
    u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
           #Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

2.Html实体

dic = {
    "&uacute;" : "&eacute;" # ù -> é
} 

some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
    #Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)

欢迎任何想法

Answer 1

您的问题是由于您的输入字符串是非unicode表示（<type 'str'>而不是<type 'unicode'>）。您必须使用u"..."语法定义输入字符串：

text=u"Larry Wall is ùm© some text"
#    ^

（除此之外你必须修复你的第一个例子中的最后一个语句 - 当前它print是输入字符串（text），而我很确定你是想看到结果（ text2））。

替换utf8字符

1 个答案: