熊猫会忽略read_csv中的设置编码?

时间:2020-02-14 07:51:49

标签: python python-3.x pandas parsing

使用Linux,Pandas 1.0.1和Python 3.6时,我在生产中遇到一个奇怪的错误:


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/opt/app-root/src/import_validation/validate_csv.py", line 275, in run
    validate(temp_csv, self.query_id)
  File "/opt/app-root/src/import_validation/validate_csv.py", line 263, in validate
    pandas.read_csv(path, encoding='latin1', sep=sep)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1253, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1268, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1458, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 12: invalid continuation byte

如您在回溯中所见,我已经将编码设置为latin1:

pandas.read_csv(path, encoding='latin1', sep=sep)

当我将latin1指定为编码时,为什么熊猫会尝试解码UTF-8?我尝试对latin1使用其他别名,它给出的结果相同。

您知道为什么熊猫似乎忽略了我的编码设置吗?

编辑:删除了有关在Windows中不起作用的注释。发生了相同的错误,我只是在传递文件时作弊,而不是以相同的方式传递它。

1 个答案:

答案 0 :(得分:1)

问题是抽象层太多了。我周围有一个包装器,如果文件以'gz'结尾,则尝试解压缩该文件。然后我给了熊猫一个路径,而不是一个临时文件。该文件当然已经有其编码设置,然后在熊猫中将忽略编码设置。解决方案是将编码传递给临时文件,或者像我一样,将原始路径传递给pandas,因为它可以自动处理解压缩的文件。