我需要转换此类字符串(其中unicode字符以特殊方式存储):
Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre
...到一个有效的utf-8字符串,如下所示:
Ce correspondant a cherché à vous joindre
我编写了代码,用这个简单的语法提取数字utf-8序列
(=XX=XX
,每个X
作为十六进制数字),但是当我尝试转换这个时,我被卡住了
序列到可打印的字符:它是一个utf-8序列,而不是Unicode代码点,因此chr()
内置在这里没用(或者至少,不是唯一的)。
我需要转换此示例值:
utf8_sequence = 0xC3A9
到这个字符串:
return_value = 'é'
此字母的Unicode代码点为U+00E9
,但我不知道如何传递
给定Unicode代码点的utf-8序列,可以与chr()
一起使用。
这是我的代码,其中的评论显示了我被困的地方:
#!/usr/bin/python3
# coding: utf-8
import re
test_string = 'Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre'
# SHOULD convert a string like '=C3=A9' to the equivalent Unicode
# char, in this example 'é'.
def vmg_to_unicode(in_string):
whole_sequence = 0 # Stores the numerical utf-8 sequence
in_length = len(in_string)
num_bytes = int(in_length / 3) # Number of bytes
bit_weight = num_bytes << 3 # Weight of char in bits (little-endian)
for i in range(0, in_length, 3): # For each char:
bit_weight -= 8
# Extract the hex number inside '=XX':
hex_number = in_string[i+1:][:2]
# Build the utf-8 sequence:
whole_sequence += int(hex_number, 16) << bit_weight
# At this point, whole_sequence contains for example 0xC3A9
# The following doesn't work, chr() expect a Unicode code point:
# return chr(whole_sequence)
# HOW CAN I RETURN A STRING LIKE 'é' THERE?
# Only for debug:
return '[0x{:X}]'.format(whole_sequence)
# In a whole string, convert all occurences of patterns like '=C3=A9'
# to their equivalent Unicode chars.
def vmg_transform(in_string):
# Get all occurences:
results = ( m for m in re.finditer('(=[0-9A-Fa-f]{2})+', in_string) )
index, out = (0, '')
for result in results:
# Concat the unchanged text:
out += in_string[index:result.start()]
# Concat the replacement of the matched pattern:
out += vmg_to_unicode(result.group(0))
index = result.end()
# Concat the end of the unchanged string:
out += in_string[index:]
return out
if __name__ == '__main__':
print('In : "{}"'.format(test_string))
print('Out : "{}"'.format(vmg_transform(test_string)))
In : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherch[0xC3A9] [0xC3A0] vous joindre"
In : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherché à vous joindre"
答案 0 :(得分:2)
bytearray
bytes
并根据UTF-8编码进行解码以下是要适应的代码部分:
s = bytearray()
for i in range(0, in_length, 3): # For each char:
bit_weight -= 8
# Extract the hex number inside '=XX':
hex_number = in_string[i+1:][:2]
# Build the utf-8 sequence:
s.append(int(hex_number,16))
# At this point, whole_sequence contains for example 0xC3A9
# The following doesn't work, chr() expect a Unicode code point:
# return chr(whole_sequence)
# HOW CAN I RETURN A STRING LIKE 'é' THERE?
# Only for debug:
return bytes(s).decode("utf-8")
结果:
In : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherché à vous joindre"