我正在尝试使用pandas模块将一个大的csv文件(大约17GB)读入python Spyder。这是我的代码
data =pd.read_csv('example.csv', encoding = 'ISO-8859-1')
但我一直收到CParserError错误消息
Traceback (most recent call last):
File "<ipython-input-3-3993cadd40d6>", line 1, in <module>
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1')
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
return parser.read()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
我知道有一些关于这个问题的讨论,但它似乎非常具体,因情况而异。有没有人可以帮助我?
我在Windows系统上使用python 3。提前谢谢。
修改
根据ResMar的建议,我尝试了以下代码
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data.append(chunk, ignore_index=True)
但它没有返回
data.shape
Out[12]: (0, 0)
然后我尝试了以下代码
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data = data.append(chunk, ignore_index=True)
它再次显示内存不足错误,这里是引用
Traceback (most recent call last):
File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module>
for chunk in reader:
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__
return self.get_chunk()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
答案 0 :(得分:0)
在我看来,很明显你的错误是什么:计算机内存不足。文件本身是17GB,并且根据经验pandas
在读取文件时将占用大约两倍的空间。所以你需要 34GB的RAM来直接读取这些数据。
现在大多数计算机都有4,8或16 GB;一些人有32个。你的计算机内存不足,C就会杀死你的过程。
您可以通过以块的形式读取数据来解决这个问题,依次对每个段执行您想要执行的任何操作。有关详细信息,请参阅pd.read_csv
参数for chunk in pd.read_csv("...", chunksize=10000):
do_something()
,但您基本上需要的内容如下:
{{1}}