Question

I have a 40GB CSV file which I have to output with different column subsets as CSVs once again, with a check that there are no NaNs in the data. I opted to use Pandas, and a minimal example of my implementation looks like this (inside a function output_different_formats):

# column_names is a huge list containing the column union of all the output
#  column subsets
scen_iter = pd.read_csv('mybigcsv.csv', header=0, index_col=False,
                        iterator=True, na_filter=False,
                        usecols=column_names)
CHUNKSIZE = 630100
scen_cnt = 0
output_names = ['formatA', 'formatB', 'formatC', 'formatD', 'formatE']
# column_mappings is a dictionary mapping the output names to their
#  respective column subsets. 
while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')

    scen_cnt+=100

I thought this was safe memory-wise, as I am iterating over the file in chunks with .get_chunk() and never placing the whole CSV in a DataFrame at once, just appending the next chunk to the end of each respective file.

However about 3.5 GBs into the output generation, my program crashed with the following MemoryError in the .to_csv line with a long Traceback ending with the following

  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

Why am I getting a MemoryError here? Do I have a memory leak somewhere in my program or am I misunderstanding something? Or could the program be incited to just randomly fail on writing to CSV for that particular chunk and maybe I should consider reducing the chunksize?

Full Traceback:

Traceback (most recent call last):
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 128, in <module>
    output_different_formats(real_world=False)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 50, in clocked
    result = func(*args, **kwargs)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 116, in output_different_formats
    mode='a', header=True, index=False, compression='gzip')
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 1188, in to_csv
    decimal=decimal)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\format.py", line 1293, in __init__
    self.obj = self.obj.loc[:, cols]
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1187, in __getitem__
    return self._getitem_tuple(key)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 720, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1323, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 966, in _getitem_iterable
    result = self.obj.reindex_axis(keyarr, axis=axis, level=level)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 2519, in reindex_axis
    fill_value=fill_value)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1852, in reindex_axis
    {axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1876, in _reindex_with_indexers
    copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3157, in reindex_indexer
    indexer, fill_tuple=(fill_value,))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3238, in _slice_take_blocks_ax0
    new_mgr_locs=mgr_locs, fill_tuple=None))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 853, in take_nd
    allow_fill=False)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

Answer 1

现在的解决方案是使用gc.collect()

手动调用垃圾收集器

while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')
        gc.collect()
    gc.collect()

添加这些行后内存消耗保持稳定，但我的为什么这种方法存在内存问题仍然不清楚。

MemoryError while reading and writing a 40GB CSV... where is my leak?

1 个答案: