简便解决方案

Question

I need to hold a very large vector in memory, about 10**8 in size, and I need a fast random access to it. I tried to use numpy.memmap, but encountered the following error:

RuntimeWarning: overflow encountered in int_scalars bytes = long(offset + size*_dbytes)

fid.seek(bytes - 1, 0): [Errno 22] Invalid argument

It seems that the memmap is using a long and my vector length is too big.

Is there a way to overcome this and use memmap? or maybe there is a good alternative?

Thanks

Answer 1

简便解决方案

听起来您正在使用32位版本的Python（我也假设您正在Windows上运行）。来自numpy.memmap文档：

在32位系统上，内存映射文件不能大于2GB。

因此，解决问题的简单方法是将Python安装升级到64位。

如果您的CPU是在过去十年中的某个时间制造的，则应该可以升级到64位Python。

替代项

只要您的Python是32位的，使用大于2 GB的数组就不会变得容易或直接。您唯一的实际选择是在最初创建阵列时将其拆分成不超过2 GB的块/将其写到磁盘上。然后，您将独立地操作每一块。

此外，由于Python本身会耗尽内存，因此您仍然必须在每段代码中使用numpy.memmap。

重型替代品

如果必须定期处理这类大型数组，则应考虑在一个大数据框架上切换代码/工作流。现在有一大堆可用于Python。我之前已经广泛使用Pyspark，并且使用起来非常简单（尽管需要大量设置）。 B. M.在评论中提到Dask，这是另一个这样的大数据框架。

尽管这只是一个一次性的任务，但启动其中一个框架可能不值得麻烦。

numpy.memmap not able to handle very big data

1 个答案:

简便解决方案

替代项

重型替代品