UnicodeDecodeError:('utf-8'编解码器)在读取Pandas中的dta文件时

时间:2016-11-21 23:49:15

标签: python pandas utf-8 stata

我正在尝试使用Pandas打开dta文件,但获得UnicodeDecodeError

>>> import pandas as pd
>>> pd.read_stata('/some/stata/file.dta',encoding='utf8') # I've tried 'utf8', "ISO-8859-1", 'latin1', 'cp1252' and not putting in anything, same error.

Traceback (most recent call last):
  File "<pyshell#123>", line 1, in <interactive>
    pd.read_stata(path,encoding='cp1252')
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 161, in read_stata
    chunksize=chunksize, encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 960, in __init__
    self._read_header()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 980, in _read_header
    self._read_new_header(first_char)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1056, in _read_new_header
    self.vlblist = self._get_vlblist()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1127, in _get_vlblist
    for i in range(self.nvar)]
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1269, in _decode
    return s.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 18: invalid start byte

该文件包含非ASCII字符,并由其他人保存(可能在Windows或Mac上)。 R 可以打开文件并将其保存为csv,然后我可以正常阅读,但是能够用Python完成所有事情会很好。

对于编码参数,在这里跟随其他线程,我尝试了'utf8',“ISO-8859-1”,'latin1','cp1252'而没有放任何东西。但是,我总是得到完全相同的错误。

知道发生了什么,我该怎么办?

我在Ubuntu 14.04上使用Python 2.7,以防万一。

1 个答案:

答案 0 :(得分:1)

已将修复程序提交给Github上的主版本,应以0.25版发布。

查看有关此问题的详细信息here

要进行临时修复,请将1334的第pandas.io.stata行更改为

return s.decode('utf-8')

return s.decode('latin-1')

不幸的是,在某些情况下,Stata或其他软件将允许一些非UTF-8字符。大概是您在使用{{1} }},并且由于它们应该是纯dta,因此Pandas会忽略编码118