I have a 40GB CSV file which I have to output with different column subsets as CSVs once again, with a check that there are no NaN
s in the data. I opted to use Pandas, and a minimal example of my implementation looks like this (inside a function output_different_formats
):
# column_names is a huge list containing the column union of all the output
# column subsets
scen_iter = pd.read_csv('mybigcsv.csv', header=0, index_col=False,
iterator=True, na_filter=False,
usecols=column_names)
CHUNKSIZE = 630100
scen_cnt = 0
output_names = ['formatA', 'formatB', 'formatC', 'formatD', 'formatE']
# column_mappings is a dictionary mapping the output names to their
# respective column subsets.
while scen_cnt < 10000:
scenario = scen_iter.get_chunk(CHUNKSIZE)
if scenario.isnull().values.any():
# some error handling (has yet to ever occur)
for item in output_names:
scenario.to_csv(item, float_format='%.8f',
columns=column_mappings[item],
mode='a', header=True, index=False, compression='gzip')
scen_cnt+=100
I thought this was safe memory-wise, as I am iterating over the file in chunks with .get_chunk()
and never placing the whole CSV in a DataFrame at once, just appending the next chunk to the end of each respective file.
However about 3.5 GBs into the output generation, my program crashed with the following MemoryError in the .to_csv
line with a long Traceback ending with the following
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
Why am I getting a MemoryError here? Do I have a memory leak somewhere in my program or am I misunderstanding something? Or could the program be incited to just randomly fail on writing to CSV for that particular chunk and maybe I should consider reducing the chunksize?
Full Traceback:
Traceback (most recent call last):
File "D:/AppData/A/MRM/Eric/output_formats.py", line 128, in <module>
output_different_formats(real_world=False)
File "D:/AppData/A/MRM/Eric/output_formats.py", line 50, in clocked
result = func(*args, **kwargs)
File "D:/AppData/A/MRM/Eric/output_formats.py", line 116, in output_different_formats
mode='a', header=True, index=False, compression='gzip')
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 1188, in to_csv
decimal=decimal)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\format.py", line 1293, in __init__
self.obj = self.obj.loc[:, cols]
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1187, in __getitem__
return self._getitem_tuple(key)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 720, in _getitem_tuple
retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1323, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 966, in _getitem_iterable
result = self.obj.reindex_axis(keyarr, axis=axis, level=level)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 2519, in reindex_axis
fill_value=fill_value)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1852, in reindex_axis
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1876, in _reindex_with_indexers
copy=copy)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3157, in reindex_indexer
indexer, fill_tuple=(fill_value,))
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3238, in _slice_take_blocks_ax0
new_mgr_locs=mgr_locs, fill_tuple=None))
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 853, in take_nd
allow_fill=False)
File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
答案 0 :(得分:2)
现在的解决方案是使用gc.collect()
while scen_cnt < 10000:
scenario = scen_iter.get_chunk(CHUNKSIZE)
if scenario.isnull().values.any():
# some error handling (has yet to ever occur)
for item in output_names:
scenario.to_csv(item, float_format='%.8f',
columns=column_mappings[item],
mode='a', header=True, index=False, compression='gzip')
gc.collect()
gc.collect()
添加这些行后内存消耗保持稳定,但我的为什么这种方法存在内存问题仍然不清楚。