使用Linux,Pandas 1.0.1和Python 3.6时,我在生产中遇到一个奇怪的错误:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
task_gen = self.task.run()
File "/opt/app-root/src/import_validation/validate_csv.py", line 275, in run
validate(temp_csv, self.query_id)
File "/opt/app-root/src/import_validation/validate_csv.py", line 263, in validate
pandas.read_csv(path, encoding='latin1', sep=sep)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1253, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1268, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1458, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 12: invalid continuation byte
如您在回溯中所见,我已经将编码设置为latin1:
pandas.read_csv(path, encoding='latin1', sep=sep)
当我将latin1指定为编码时,为什么熊猫会尝试解码UTF-8?我尝试对latin1使用其他别名,它给出的结果相同。
您知道为什么熊猫似乎忽略了我的编码设置吗?
编辑:删除了有关在Windows中不起作用的注释。发生了相同的错误,我只是在传递文件时作弊,而不是以相同的方式传递它。
答案 0 :(得分:1)
问题是抽象层太多了。我周围有一个包装器,如果文件以'gz'结尾,则尝试解压缩该文件。然后我给了熊猫一个路径,而不是一个临时文件。该文件当然已经有其编码设置,然后在熊猫中将忽略编码设置。解决方案是将编码传递给临时文件,或者像我一样,将原始路径传递给pandas,因为它可以自动处理解压缩的文件。