Question

我有一个主要是UTF-8的文件，但是一些Windows-1252字符也已经找到了。

我创建了一个表，用于将Windows-1252（cp1252）字符映射到它们的Unicode对应字符，并希望用它来修复错误编码的字符，例如。

cp1252_to_unicode = {
    "\x85": u'\u2026', # …
    "\x91": u'\u2018', # ‘
    "\x92": u'\u2019', # ’
    "\x93": u'\u201c', # “
    "\x94": u'\u201d', # ”
    "\x97": u'\u2014'  # —
}

for l in open('file.txt'):
    for c, u in cp1252_to_unicode.items():
        l = l.replace(c, u)

但是尝试以这种方式替换会导致引发UnicodeDecodeError，例如：

"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

有关如何处理此事的任何想法？

Answer 1

如果您尝试将此字符串解码为utf-8，正如您所知，您将收到“UnicodeDecode”错误，因为这些虚假的cp1252字符无效utf-8 -

但是，Python编解码器允许您使用codecs.register_error函数注册callback to handle encoding/decodin g错误 - 它获取UnicodeDecodeerror aa参数 - 您可以编写这样的处理程序，尝试将数据解码为“cp1252” ，并继续在utf-8中解码其余的字符串。

在我的utf-8终端中，我可以构建一个混合不正确的字符串，如下所示：

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

我在这里编写了所谓的回调函数，并发现了一个问题：即使你将字符串解码的位置增加1，所以它会在下一个chratcer上开始，如果下一个字符也不是utf-如果超出范围（128），则在第一个超出范围（128）字符时引发错误 - 这意味着，如果找到连续的非ascii，非utf-8字符，则解码“返回”。

这就是在error_handler中有一个状态变量，它检测到这个“向后走”并从最后一次调用中恢复解码 - 在这个简短的例子中，我将它实现为全局变量 - （它将具有在每次调用解码器之前手动重置为“-1”：

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

在控制台上：

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

Answer 2

感谢jsbueno以及其他谷歌搜索和其他冲击的重击我用这种方式解决了。

#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")

此版本允许有限的机会修复无效字符。未知字符将替换为安全值。

import codecs    
replacement = {
   '85' : '...',           # u'\u2026' ... character.
   '96' : '-',             # u'\u2013' en-dash
   '97' : '-',             # u'\u2014' em-dash
   '91' : "'",             # u'\u2018' left single quote
   '92' : "'",             # u'\u2019' right single quote
   '93' : '"',             # u'\u201C' left double quote
   '94' : '"',             # u'\u201D' right double quote
   '95' : "*"              # u'\u2022' bullet
}

#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
    errStr = unicodeError[1]
    errLen = unicodeError.end - unicodeError.start
    nextPosition = unicodeError.start + errLen
    errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
    if errHex in replacement:
        return u'%s' % replacement[errHex], nextPosition
    return u'%s' % errHex, nextPosition   # Comment this line out to get a question mark
    return u'?', nextPosition

codecs.register_error("mixed", mixed_decoder)

xmlText = xmlText.decode("utf-8", "mixed")

基本上我试图把它变成utf8。对于任何失败的角色，我只需将其转换为HEX，这样我就可以在自己的表格中显示或查找。

这不是很漂亮，但它确实让我能够理解混乱的数据

Answer 3

@jsbueno的良好解决方案，但不需要全局变量last_position，请参见：

def mixed_decoder(error: UnicodeError) -> (str, int):
     bs: bytes = error.object[error.start: error.end]
     return bs.decode("cp1252"), error.start + 1

import codecs
codecs.register_error("mixed", mixed_decoder)

a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"

s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"

Answer 4

通常称为Mojibake。

有一个不错的Python库叫做ftfy，可以为您解决这些问题。

示例：

>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (Ð½Ð°Ð¿Ð¾Ð¼Ð¸Ð½Ð°Ð»ÐºÐ¸)")
'Шепот (напоминалки)'

Answer 5

今天刚遇到这个问题，所以这是我的问题和我自己的解决方案：

original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'

def mixed_decoding(s):
    output = ''
    ii = 0
    for c in s:
        if ii <= len(s)-1:
            if s[ii] == '\\' and s[ii+1] == 'x':
                b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
                output = output+b
                ii += 3
            else:
                output = output+s[ii]
        ii += 1
    print(output)
    return output

decoded_string = mixed_decoding(original_string)

现在打印：
>>> Notificação de Emissão de Nota Fiscal Eletrônica。

Python - 处理混合编码文件

5 个答案: