Question

我有以下由Apache生成的错误代码：

\xed\xe5 \xff\xb\xff\xe5\xf2\xf1\xff \xef\xf0\xe8\xeb\xee\xe6\xe5\xed\xe8\xe5\xec

我发现\x转义序列表明以下两个字符是UTF十六进制符号。例如，单词HELLO可以编码为\x48\x45\x4C\x4C\x4F。但我似乎无法弄清楚如何解码我拥有的字符串？我搜索过UTF编码表，但没有发现任何符合我编码符号的字符。我甚至不知道我应该寻找一个字节或2个字节的编码。

我在使用俄语语言环境的PC上，如果有帮助的话。

Answer 1

看起来您的字符串是cp-1251编码：

s.decode('string_escape').decode('cp1251')

打印一些看似有意义的内容（除了\xb不正确 - copypaste错误？）：

s = r'\xed\xe5 \xff?xb\xff\xe5\xf2\xf1\xff \xef\xf0\xe8\xeb\xee\xe6\xe5\xed\xe8\xe5\xec'

s = s.decode('string_escape').decode('cp1251')
#не я?xbяется приложением

我甚至不知道我应该寻找一个字节还是两个字节的编码。

这是chardet来救援的地方：

import chardet

s = r'\xed\xe5 \xff?xb\xff\xe5\xf2\xf1\xff \xef\xf0\xe8\xeb\xee\xe6\xe5\xed\xe8\xe5\xec'

print chardet.detect(s.decode('string_escape'))
# {'confidence': 0.99, 'encoding': 'windows-1251'}

如果你不了解python，你也可以使用javascript，例如： http://jsfiddle.net/L3Z4b/

Answer 2

使用unicode-escape, string-escape encoding：

>>> r'\x48\x45\x4C\x4C\x4F'.decode('unicode-escape')
u'HELLO'
>>> r'\x48\x45\x4C\x4C\x4F'.decode('string-escape')
'HELLO'

解码十六进制的UTF-8字符

2 个答案: