Question

自从我们操作以来，我有大约500-1500个数据帧存储在内存中用于计算的pandas数据帧。

不想使用HDF5，因为这个磁盘写入（我们想保留在内存中）。 Python数据帧中最有效的存储是什么？

1）列表但在500个数据帧之后看起来内存很重 2）NUmpy对象数组？ 3）元组？

解答：

由于没有人能够对此做出实际答案（人们只有负面的项目而没有任何评论......）？我提出了迄今为止我找到的最佳解决方案：

blist包似乎比当前列表具有更好的性能列表上的操作。我测试了它，现在效果很好。

以下是blist渐近优于内置列表的一些用例：

Use Case                                       blist            list
Insertion into or removal from a list          O(log n)         O(n)
Taking slices of lists                         O(log n)         O(n)
Making shallow copies of lists                 O(1)             O(n)
Changing slices of lists                       O(log n + log k) O(n+k)
Multiplying a list to make a sparse list       O(log k)         O(kn)
Maintain a sorted lists with bisect.insort     O(log**2 n)      O(n)

参考 https://pypi.python.org/pypi/blist/

Answer 1

由于您的数据框已经存储在内存中，因此无论您将其嵌入到哪个对象中，它始终至少需要所有这些对象所需的ram数量

数据框的一个有用功能是memory_usage，它可以让您知道给定数据帧占用多少ram。当您不需要精度时，它会让您怀疑是否真的想存储双精度浮点数。

Answer 2

似乎blist包比当前列表具有更好的性能列表上的操作。我测试了它，现在效果很好。

以下是blist渐近优于内置列表的一些用例：

Use Case                                       blist            list
Insertion into or removal from a list          O(log n)         O(n)
Taking slices of lists                         O(log n)         O(n)
Making shallow copies of lists                 O(1)             O(n)
Changing slices of lists                       O(log n + log k) O(n+k)
Multiplying a list to make a sparse list       O(log k)         O(kn)
Maintain a sorted lists with bisect.insort     O(log**2 n)      O(n)

参考 https://pypi.python.org/pypi/blist/

存储pandas数据帧列表的有效方法

2 个答案: