Question

我正在尝试使用Pandas打开dta文件，但获得UnicodeDecodeError：

>>> import pandas as pd
>>> pd.read_stata('/some/stata/file.dta',encoding='utf8') # I've tried 'utf8', "ISO-8859-1", 'latin1', 'cp1252' and not putting in anything, same error.

Traceback (most recent call last):
  File "<pyshell#123>", line 1, in <interactive>
    pd.read_stata(path,encoding='cp1252')
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 161, in read_stata
    chunksize=chunksize, encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 960, in __init__
    self._read_header()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 980, in _read_header
    self._read_new_header(first_char)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1056, in _read_new_header
    self.vlblist = self._get_vlblist()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1127, in _get_vlblist
    for i in range(self.nvar)]
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/stata.py", line 1269, in _decode
    return s.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 18: invalid start byte

该文件包含非ASCII字符，并由其他人保存（可能在Windows或Mac上）。 R 可以打开文件并将其保存为csv，然后我可以正常阅读，但是能够用Python完成所有事情会很好。

对于编码参数，在这里跟随其他线程，我尝试了'utf8'，“ISO-8859-1”，'latin1'，'cp1252'而没有放任何东西。但是，我总是得到完全相同的错误。

知道发生了什么，我该怎么办？

我在Ubuntu 14.04上使用Python 2.7，以防万一。

Answer 1

已将修复程序提交给Github上的主版本，应以0.25版发布。

查看有关此问题的详细信息here。

要进行临时修复，请将1334的第pandas.io.stata行更改为

return s.decode('utf-8')

到

return s.decode('latin-1')

不幸的是，在某些情况下，Stata或其他软件将允许一些非UTF-8字符。大概是您在使用{{1} }}，并且由于它们应该是纯dta，因此Pandas会忽略编码118。

UnicodeDecodeError：（'utf-8'编解码器）在读取Pandas中的dta文件时

1 个答案: