如何处理未知编码

时间:2016-09-13 17:00:24

标签: python python-2.7 encoding

我遇到了一些需要打开不同编码文件的Python脚本问题。

我通常使用这个:

with open(path_to_file, 'r') as f:
    first_line = f.readline()

当文件正确编码时,这很有效。

但有时,它不起作用,例如with this file,我有这个:

In [22]: with codecs.open(filename, 'r') as f:
    ...:    a = f.readline()
    ...:    print(a)
    ...:    print(repr(a))
    ...:     
��Test for StackOverlow

'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'

我想在这些行上搜索一些内容。遗憾的是,我不能这样做:

In [24]: "Test" in a
Out[24]: False

我发现很多问题涉及同一类型的问题:

  1. Unicode (UTF-8) reading and writing to files in Python
  2. UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
  3. https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file
  4. how can i escape '\xff\xfe' to a readable string
  5. 但无法设法正确解码文件......

    使用codecs.open():

    In [17]: with codecs.open(filename, 'r', "utf-8") as f:
        a = f.readline()
        print(a)
       ....:     
    ---------------------------------------------------------------------------
    UnicodeDecodeError                        Traceback (most recent call last)
    <ipython-input-17-0e72208eaac2> in <module>()
          1 with codecs.open(filename, 'r', "utf-8") as f:
    ----> 2     a = f.readline()
          3     print(a)
          4 
    
    /usr/lib/python2.7/codecs.pyc in readline(self, size)
        688     def readline(self, size=None):
        689 
    --> 690         return self.reader.readline(size)
        691 
        692     def readlines(self, sizehint=None):
    
    /usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
        543         # If size is given, we call read() only once
        544         while True:
    --> 545             data = self.read(readsize, firstline=True)
        546             if data:
        547                 # If we're at a "\r" read one extra character (which might
    
    /usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
        490             data = self.bytebuffer + newdata
        491             try:
    --> 492                 newchars, decodedbytes = self.decode(data, self.errors)
        493             except UnicodeDecodeError, exc:
        494                 if firstline:
    
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
    
    带编码的

    ('utf-8):

    In [18]: with codecs.open(filename, 'r') as f:
        a = f.readline()
        print(a)
       ....:     a.encode('utf-8')
       ....:     print(a)
       ....:     
    ��Test for StackOverlow
    
    ---------------------------------------------------------------------------
    UnicodeDecodeError                        Traceback (most recent call last)
    <ipython-input-18-7facc05b9cb1> in <module>()
          2     a = f.readline()
          3     print(a)
    ----> 4     a.encode('utf-8')
          5     print(a)
          6 
    
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
    

    我找到了一种使用Vim自动更改文件编码的方法:

    system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)
    

    但我想在不使用Vim的情况下这样做...

    任何帮助都将受到赞赏。

2 个答案:

答案 0 :(得分:5)

看起来这是utf-16-le(utf-16小端...)但你错过了最后的\x00

>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>

答案 1 :(得分:4)

看起来您需要检测输入文件中的编码。 this question答案中提到的chardet库可能会有所帮助(但请注意,无法进行完整的编码检测)。

然后你可以用已知的编码写出文件。处理Unicode时,请记住它必须在进程外通信之前编码到合适的字节流中。解码输入,然后在输出上进行编码。