问题是如何提取字符串,在字符串中表示为字节(警告)?我的意思是:
>>> s1 = '\\xd0\\xb1' # But this is NOT bytes of s1! s1 should be 'б'!
'\\xd0\\xb1'
>>> s1[0]
'\\'
>>> len(s1) # The problem is here: I thought I would see (2), but:
8
>>> type(s1)
<class 'str'>
>>> type(s1[0])
<class 'str'>
>>> s1[0] == '\\'
True
那么如何将 s1 转换为'б'(西里尔符号 - '\ xd0 \ xb1'的真实表示)。我已经在这里问了一个类似的问题,但我的不好被误解为 s1 的真实代表性观点(我认为'\' '\',而不是'\\')。
答案 0 :(得分:3)
>>> s1 = b'\xd0\xb1'
>>> s1.decode("utf8")
'б'
>>> len(s1)
2
答案 1 :(得分:3)
尝试以下代码。警告,它只是一个概念证明。当文本还包含写为非转义序列的字符时,必须以更复杂的方式进行替换(稍后我会在需要时显示)。请参阅以下评论。
import binascii
s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1)) # list() to emphasize what are the characters
s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))
b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))
s3 = b.decode('utf8')
print('s3 =', ascii(s3))
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s3)
它打印在concole:
c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'
它将字符写入output.txt
文件。
问题在于该问题结合了unicode转义和转义二进制值。换句话说,unicode字符串可以包含某种以某种方式表示二进制值的序列;但是,您不能直接将该二进制值强制转换为unicode字符串,因为任何unicode字符实际上都是一个抽象整数,并且整数可以用多种方式表示(作为一个字节序列)。
如果unicode字符串包含\\n
之类的转义序列,则可以使用&#39; unicode_escape&#39; bytes.decode()
的处方。但是,在这种情况下,您需要从ascii转义序列解码,然后从utf-8解码。
更新:这是一个用其他ascii字符转换你的字符串的函数(即不仅仅是转义序列)。该函数使用有限自动机 - 最初可能看起来太复杂(实际上它只是冗长的)。
def userDecode(s):
status = 0
lst = [] # result as list of bytes as ints
xx = None # variable for one byte escape conversion
for c in s: # unicode character
print(status, ' c ==', c) ## just for debugging
if status == 0:
if c == '\\':
status = 1 # escape sequence for a byte starts
else:
lst.append(ord(c)) # convert to integer
elif status == 1: # x expected
assert(c == 'x')
status = 2
elif status == 2: # first nibble expected
xx = c
status = 3
elif status == 3: # second nibble expected
xx += c
lst.append(int(xx, 16)) # this is a hex representation of int
status = 0
# Construct the bytes from the ordinal values in the list, and decode
# it as UTF-8 string.
return bytes(lst).decode('utf-8')
if __name__ == '__main__':
s = userDecode('\\xd0\\xb1whatever')
print(ascii(s)) # cannot be displayed on console that does not support unicode
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s)
同样查看生成的文件。删除调试打印。它在控制台上显示以下内容:
c:\__Python\user\so20210201>b.py
0 c == \
1 c == x
2 c == d
3 c == 0
0 c == \
1 c == x
2 c == b
3 c == 1
0 c == w
0 c == h
0 c == a
0 c == t
0 c == e
0 c == v
0 c == e
0 c == r
'\u0431whatever'