Question

我需要根据一组自定义的替换来替换unicode。自定义替换是由其他人的API定义的，我基本上只需处理它。目前我已将所有必需的替换提取到csv文件中。这是一个示例：

\u0020, 
\u0021,!
\u0023,#
\u0024,$
\u0025,%
\u0026,&
\u0028,(
\u0029,)
\u002a,*
\u002b,+
\u002c,","
\u002d,-
\u002e,.
\u002f,/
\u03ba,kappa
...

我在MS Excel中通过破解API所有者在需要进行转换时自己使用的java程序生成了这个（并且没有...他们不会在API接收输入时运行转换器...）。定义了约1500个替换。

当我生成输出（来自我的Django应用程序）作为输入发送到他们的API时，我想处理替换。以下是我一直在尝试的方法：

class UTF8Converter(object):
    def __init__(self):
        #create replacement mapper
        full_file_path = os.path.join(os.path.dirname(__file__),
                                      CONVERSION_FILE)
        with open(full_file_path) as csvfile:
            reader = csv.reader(csvfile)
            mapping = []
            for row in reader:
                #remove escape-y slash
                mapping.append( (row[0], row[1]) ) # here's the problem
        self.mapping = mapping

    def replace_UTF8(self, string):
        for old, new in self.mapping:
            print new
            string.replace(old, new)
        return string

问题是csv文件中的unicode代码显示为，例如 self.mapping[example][0] = '\\u00e0'。好吧，那是错的，所以让我们试试：

mapping.append( (row[0].decode("string_escape"), row[1]) )

没有变化。怎么样：

mapping.append( (row[0].decode("unicode_escape"), row[1]) )

好的，现在self.mapping[example][0] = u'\xe0'。所以，是的，那是我需要替换的角色......但我需要调用replace_UTF8（）函数的字符串看起来像u'\u00e0'。

我还尝试了row[0].decode("utf-8")，row[0].encode("utf-8")，unicode(row[0], "utf-8")。

我也尝试了this但是我没有csv文件中的unicode字符，我有unicode代码点（不确定这是正确的术语还是什么）。

那么，如何将我读入的字符串从csv文件转换为unicode字符串，我可以将其用于mythingthatneedsconverted.replace（...）？

或者......我是否需要对csv文件执行其他操作才能使用更合理的方法？

Answer 1

我不认为你的问题确实存在：

好的，现在自我映射[例子] [0] =你'\ xe0'。所以，是的，那是我需要替换的角色......但我需要调用replace_UTF8（）函数的字符串看起来像你'\ u00e0'。

这些只是完全相同字符串的不同表示。你可以自己测试一下：

>>> u'\xe0' == u'\u00e0'
True

实际问题是你没有做任何替换。在这段代码中：

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string.replace(old, new)
    return string

你只是反复调用string.replace，它返回一个新字符串，但对string本身没有任何作用。（它不能对string本身做任何事情;字符串是不可变的。）你想要的是：

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string = string.replace(old, new)
    return string

但是，如果string确实是UTF-8编码的str，正如函数名所暗示的那样，这仍然无效。当您使用UTF-8编码u'\u00e0'时，您得到的是'\xce\xa0'。那里没有\u00e0被替换。所以，你真正需要做的是解码它，做替换，然后重新编码。像这样：

def replace_UTF8(self, string):
    u = string.decode('utf-8')
    for old, new in self.mapping:
        print new
        u = u.replace(old, new)
    return u.encode('utf-8')

或者更好的是，在整个程序中保持unicode而不是编码str，除非在最边缘，所以你不必担心这些东西。

最后，这是一种非常缓慢而复杂的替换方式，当字符串（str和unicode）都有内置的translate方法来完成你的工作时想。

不是将表构建为Unicode字符串对的列表，而是将其构建为dict映射序数到序数：

mapping = {}
for row in reader:
    mapping[ord(row[0].decode("unicode_escape"))] = ord(row[1])

现在，即使您的编码混乱，整个事情也只是单线：

def replace_UTF8(self, string):
    return string.decode('utf-8').translate(self.mapping).encode('utf-8')

Python - 读取unicode替换的csv文件

1 个答案: