Question

我正在寻找（Python接口）可迭代的数据结构可以容纳大量数量的商品。理想情况下，所有人都使用的内存列表中的项目大于可用的RAM：对象是在访问某些磁盘文件时透明地交换进出; 在任何一个RAM中只加载一小部分可配置的数据给定时间。换句话说，我希望看到像C ++的STXXL库这样的东西，但我只需要一个类似列表的容器。

此外，数据结构需要允许：存储任意Python对象，添加/删除元素（按位置或按值），迭代所有元素元素，in / __contain__检查，以及（可能）快速的方法选择满足简单属性等式谓词的元素（例如，x.foo = 'bar'）

以下是我希望看到的API的示例::

   # persist list data to `foo.dat`, keep 100 items in memory
   l = FileBackedList('foo.dat', 100)

   # normal Python list operations work as expected
   l.append('x'); len(l) == 1
   l.extend([1, 2, 3])
   l.remove('x'); len(l) == 3
   l.pop(0);      len(l) == 2

   2 in l  # => True

   # there should be at least one way of doing the following
   k = [item for item in l if item > 2]
   k = filter(l, lambda item: item > 2)

可以接受的是，实施不是特别快或高效;处理大量对象的能力约束记忆是最重要的。

在我开始推出自己的实现之前，有没有我已经可以插入我的应用程序的现有库了吗？或至少一些代码从中获取灵感？

Answer 1

我相信你正在寻找像numpy这样的memmap数组。如果您想要一个功能更全面的表格数据结构，来自graphlab的SFrame可以正常工作，但请注意，该库只能用于非商业用途。你可以使用numpy做任何事情。

Answer 2

@Adam：SFrame是开源的。这正是您所需要的（https://github.com/dato-code/SFrame）

Answer 3

shelve在python标准库中，大约是您所需要的一半。不幸的是，它仅创建了一个类似于字典的对象。因此，我写了list-like interface for dictionary-like objects。这是实际的样子：

import shelve
import tempfile
import os.path

from listlike import ListLike

with tempfile.TemporaryDirectory() as tempdir:
    with shelve.open(os.path.join(tempdir,'foo.dat')) as sf:
        q = ListLike(sf)

        # normal Python list operations work as expected
        q.append('x');
        assert len(q) == 1
        q.extend([1, 2, 3])
        q.remove('x');
        assert len(q) == 3
        q.pop(0);
        assert len(q) == 2
        q.append('Hello!')
        assert q[2] == 'Hello!'

        # technically you can use the list() to create an actual list
        assert list(q) == [2, 3, 'Hello!']
        # but if your sf is super large, this obviously is bad.
        # I use it here for illustration purposes.

        # still, you can use all the normal list operations
        # (that I remembered to implement/check)
        assert 2 in q  # => True
        del q[2]
        assert list(q[1:2]) == [3]
        assert list(q[-1:]) == [3]
        # except addition, 'cause we don't want to copy large data
        try:
            q + [10]
        except TypeError:
            pass
        # but, iadd works fine
        q += [10] # same as q.extend([10])


        # normal index range rules
        try:
            q[100]
        except IndexError:
            pass

        q.extend([0, 1, 2, 3, 4, 5])
        # both of the following work as intended
        k1 = [item for item in q if item > 2]
        k2 = list(filter(lambda item: item > 2, q))
        assert list(q) == [2, 3, 10, 0, 1, 2, 3, 4, 5]
        assert k1 == [3, 10, 3, 4, 5]
        assert k2 == [3, 10, 3, 4, 5]
        assert k1 == k2

        # the values get pickled, so they can be any arbitrary object
        q.append(A(1,2))
        q.append(A('Hello',' there'))
        q.append(A('Hello',2))

        # reminder: pickling pickles the whole class, so if you are
        # storing instances of your own custom class, then updates to
        # that class' code won't be reflected in the persisted instances.

注意事项

在此实现中，任何需要更改元素索引（弹出，插入，移除，删除等）的操作都非常慢（它会腌制/解开所有内容以更改索引）。对于基本的.append() / .extend()和get操作，其速度与索引shelve的速度相同。
shelve在下面使用dbm。有dbm种不同的实现方式。
我认为，如果您的数字元素太大以致keys()的列表不适合内存，那么shelve会在尝试加载文件时耗尽内存。

已缓存

使用一个shelve.Shelf对象，我发现它不慢重复读取具有1300万个条目的1.5GB文件中的1000个任意位置，就如同具有10万个的15MB文件中一样项。

我怀疑这是因为操作系统/文件系统执行大量内部文件缓存以加快对公用扇区的读取速度。也许gdbm已经进行了一些缓存？

无论如何，如果确实需要，可以扩展shelve.Shelf或我的ListLike并在get函数上使用functools.lru_cache。

Answer 4

您可以考虑简单地创建一个足够大的swap file来一次将所有数据保存在应用程序中。

交换文件（或页面文件）会保留部分磁盘空间，以便在应用程序使用过多RAM时使用。这样一来，操作系统本身就可以处理RAM中的缓存并将东西移到磁盘上，而python中没有任何特殊之处。

尽管如此，这需要设置管理员权限，对于不同的操作系统有所不同，并且需要分配足够的磁盘空间以一次将整个数据集保存在应用程序中。

Python中的文件支持列表

4 个答案: