将csv文件读入python Spyder时出现CParserError

时间:2017-02-04 00:41:25

标签: python-3.x csv pandas

我正在尝试使用pandas模块将一个大的csv文件(大约17GB)读入python Spyder。这是我的代码

data =pd.read_csv('example.csv', encoding = 'ISO-8859-1')

但我一直收到CParserError错误消息

Traceback (most recent call last):

File "<ipython-input-3-3993cadd40d6>", line 1, in <module>
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1')

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
return parser.read()

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)

File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)

File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)

File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)

File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)

File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)

CParserError: Error tokenizing data. C error: out of memory

我知道有一些关于这个问题的讨论,但它似乎非常具体,因情况而异。有没有人可以帮助我?

我在Windows系统上使用python 3。提前谢谢。

修改

根据ResMar的建议,我尝试了以下代码

data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
   data.append(chunk, ignore_index=True)

但它没有返回

data.shape
Out[12]: (0, 0)

然后我尝试了以下代码

data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
   data = data.append(chunk, ignore_index=True)

它再次显示内存不足错误,这里是引用

Traceback (most recent call last):

File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module>
for chunk in reader:

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__
return self.get_chunk()

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk
return self.read(nrows=size)

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)

File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)

File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)

File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208)

File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)

File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)

File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)

CParserError: Error tokenizing data. C error: out of memory

1 个答案:

答案 0 :(得分:0)

在我看来,很明显你的错误是什么:计算机内存不足。文件本身是17GB,并且根据经验pandas在读取文件时将占用大约两倍的空间。所以你需要 34GB的RAM来直接读取这些数据。

现在大多数计算机都有4,8或16 GB;一些人有32个。你的计算机内存不足,C就会杀死你的过程。

您可以通过以块的形式读取数据来解决这个问题,依次对每个段执行您想要执行的任何操作。有关详细信息,请参阅pd.read_csv参数for chunk in pd.read_csv("...", chunksize=10000): do_something() ,但您基本上需要的内容如下:

{{1}}