Panda的read_csv总是在小文件上崩溃

时间:2014-08-18 23:19:26

标签: python csv import pandas crash

我正在尝试导入一个相当(217行,87列,15k) csv 文件,以便在 Python中进行分析使用 Panda 。该文件的结构相当差,但我仍想导入它,因为它是原始数据,我不想在Python之外手动操作(例如使用Excel)。不幸的是,它总是导致崩溃"内核似乎已经死亡。它将自动重启"。

https://www.wakari.io/sharing/bundle/uniquely/ReadCSV

有些研究表明可能与read_csv崩溃,但总是对于非常大的文件,因此我不明白这个问题。使用本地安装(Anaconda 64位,IPython(Py 2.7)Notebook)和Wakari都会发生崩溃。

任何人都可以帮助我吗?真的很感激。非常感谢!

代码:

# I have a somehow ugly, illustrative csv file, but it is not too big, 217 rows, 87 colums.
# File can be downloaded at http://www.win2day.at/download/lo_1986.csv

# In[1]:

file_csv = 'lo_1986.csv'
f = open(file_csv, mode="r")
x = 0
for line in f:
    print x, ": ", line
    x = x + 1
f.close()


# Now I'd like to import this csv into Python using Pandas - but this always lead to a crash:
# "The kernel appears to have died. It will restart automatically."

# In[ ]:

import pandas as pd
pd.read_csv(file_csv, delimiter=';')

# What am I doing wrong?

2 个答案:

答案 0 :(得分:7)

这是因为文件中的无效字符(例如0xe0)

如果在read_csv()调用中添加encoding参数,您将看到此堆栈跟踪而不是段错误

>>> df = pandas.read_csv("/tmp/lo_1986.csv", delimiter=";", encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 205, in _read
    return parser.read()
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
  File "parser.pyx", line 1051, in pandas.parser.TextReader._string_convert (pandas/parser.c:10905)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas/parser.c:15657)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

在请求pandas读入文件

之前,您可以执行一些预处理来删除这些字符

附上图片以突出显示文件中的无效字符

enter image description here

答案 1 :(得分:4)

非常感谢您的发言。我不能同意这个评论,这确实是一个非常混乱的csv。但不幸的是,这就是奥地利国家彩票分享他们的信息的方式,包括抽取的数字和支付报价。

我继续玩,还看着特殊人物。最后,至少对我来说,解决方案非常简单:

pd.read_csv(file_csv, delimiter=';', encoding='latin-1', engine='python')

添加的编码有助于正确显示特殊字符,但游戏更改是引擎参数。说实话,我不明白为什么,但现在它有效。

再次感谢!