我正在尝试使用熊猫读取〜2GB的csv文件,并设置了一个for循环将文件分成较小的块。但是,尝试此操作时我仍然遇到MemoryError。
我最初的想法是将每个块添加到列表中,最后将列表合并到数据帧中。但是,当我运行循环时,我收到一个MemoryError,我认为可以通过分块避免。您是否无法在循环中添加内容?
# import pandas library
import pandas as pd
# parse gct file into dataframe
datafile = "test_data_1.gct"
# create list to store chunks
df_list = []
# read in datafram in chunks
for chunk in pd.read_csv(datafile, skiprows=2, sep='\t', chunksize=500):
df_list.append(chunk)
运行此命令时,我得到以下信息:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1051, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 348, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 459, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 7364, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4872, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4918, in form_blocks
int_blocks = _multi_blockify(items_dict['IntBlock'])
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4995, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 5037, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError