Question

我目前正在使用一个4千兆字节大小的文件作为开放式寻址哈希表。为了读取每个偏移量，我使用file.seek（）函数获取1字节（char）数据。我想使用存储桶优化文件的大小（在没有数据的偏移上节省空间），以便最佳地进行优化我想知道当我使用file.seek（）时，有多少字节被缓存到内存中？这样我可以调整存储区，这样文件将占用更少的空间，但磁盘I / O读取不会增加。

Answer 1

file.seek()方法将非常有效，但也非常慢。您可能希望按页面边界对齐所有内容，因此我建议您不要跨越4 kiB边界。

如果您使用的是64位处理器，请使用mmap将整个文件映射到内存中，而不是使用file.seek()。然后，您可以使用页面大小通常为4 kiB的规则，从而对齐4 kiB边界上的所有内容。这肯定比使用file.seek的dummily更快;虽然最终可能会占用更多内存，但操作系统可以根据您的访问模式进行微调。

在Python 3上，您将使用mmap，如下所示：

# provided that your hashtable is in this file
# and its size is 4 GiB
with open("hashtable", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)

    # here mm behaves like 4 billion element bytearray
    # that you can read from and write to. changes
    # are flushed to the underlying file.

    # set 1 byte in the file
    mm[123456789] = 42

    # ensure that changes are written to disk
    mm.flush()

    # close the mapping
    mm.close()

在python中使用file.seek（）时，通常会将多少字节加载到内存中？

1 个答案: