我正在尝试在Python 3中快速计算大型数据集(745行x 18,048列)的相关矩阵。数据集最初是从50MB的netCDF文件中读取的,经过一些操作后,它达到了这个大小。所有数据都存储为float32s。因此,我计算出最终的相关矩阵应该在1.2 GB左右,这应该很容易适应我的8 GB RAM。使用pandas的DataFrame及其方法,它可以在大约100分钟内计算整个相关矩阵,因此可以计算它。
我读了dask模块并决定实现它。但是,当我尝试使用相同的方法进行计算时,它几乎会立即遇到MemoryError,即使它应该适合内存。经过一番摆弄后,我意识到它甚至在相对较小的1000 x 1000数据集上失败了。是否有其他东西在下面导致这个错误?我在下面发布了我的代码:
import dask.dataframe as ddf
# prepare data in dataframe called df
daskdf = ddf.from_pandas(df, chunksize=1000)
correlation = daskdf.corr()
correlation.compute()
这是错误跟踪:
Traceback (most recent call last):
File "C:/Users/mughi/Documents/College Stuff/Project DIVA/Preprocessing Data/DaskCorr.py", line 36, in <module>
correlation.compute()
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 94, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 201, in compute
results = get(dsk, keys, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\threaded.py", line 76, in get
**kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 500, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 266, in execute_task
result = _execute_task(task, data)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 247, in _execute_task
return func(*args2)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\dataframe\core.py", line 3283, in cov_corr_chunk
keep = np.bitwise_and(mask[:, None, :], mask[:, :, None])
谢谢!