Question

根据Gmail API reference中的问题和答案，我可以使用binascii包解码其中包含'_'的utf-8字符串。

def toUtf(r):
    try:
        rhexonly = r.replace('_', '')
        rbytes = binascii.unhexlify(rhexonly)
        rtext = rbytes.decode('utf-8')
    except TypeError:
        rtext = r
    return rtext

此代码仅适用于utf-8字符：

r = '_ed_8e_b8'
print toUtf(r)
>> 편

但是，当字符串中包含正常的ascii代码时，此代码不起作用。 ascii可以在字符串中的任何位置。

r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'

也许，我可以使用正则表达式来提取utf-8部分和ascii部分以在转换后重新考虑，但我想知道是否有更简单的方法来进行转换。有什么好办法吗？

Answer 1

Quite straightforward with re.sub:

import re

bytegroup = r'(_[0-9a-z]{2})+'

def replacer(match):
    return toUtf(match.group())

rtext = re.sub(bytegroup, replacer, r, flags=re.I)

Answer 2

这是你得到的一些真正可怕的输入。它仍然可以修复。首先，替换非＆＃34;编码＆＃34;十六进制等值的东西：

import itertools
import re

r = '_2f119_ed_8e_b8'

# Split so you have even entries in the list as ASCII, odd as hex encodings
rsplit = re.split(r'((?:_[0-9a-fA-F]{2})+)', r)   # ['', '_2f', '119', '_ed_8e_b8', '']

# Process the hex encoded UTF-8 with your existing function, leaving
# ASCII untouched
rsplit[1::2] = map(toUtf, rsplit[1::2])  # ['', '/', '119', '관', '']

rtext = ''.join(rsplit)  # '/119편'

上面是一个详细的版本，显示了各个步骤，但是当chthonicdaemon's answer指出时，它可以大大缩短。您使用与re.sub相同的正则表达式而不是re.split，并传递一个函数来执行替换而不是替换模式字符串：

# One-liner equivalent to the above with no intermediate lists
rtext = re.sub(r'(?:_[0-9a-f]{2})+', lambda m: toUtf(m.group()), r, flags=re.I)

你可以把它打包成一个函数本身，所以你有一个函数处理纯十六进制编码的UTF-8，第二个通用函数使用第一个函数作为处理混合非编码ASCII和十六进制编码的一部分UTF-8数据。

请注意，如果非编码ASCII可能正常包含_，那么这一切都无法正常工作;正则表达式试图尽可能地成为目标，但你在这里遇到了一个问题，无论你如何精确地定位你的启发式方法，一些ASCII数据都会被误认为编码的UTF-8数据。

用Python中的ascii代码进行UTF-8解码

2 个答案: