Question

我正在编写一个使用动态编程来解决难题的程序。 DP解决方案需要存储大型表。全表约占300 Gb。物理上它存储在40~7Gb文件中。我用字节\xFF标记未使用的表条目。我想快速为这张桌子分配空间。该程序必须在Windows和Linux下运行。

简而言之，我希望以跨平台的方式有效地创建填充特定字节的大文件。

以下是我目前正在使用的代码：

def reset_storage(self, path):
    fill = b'\xFF'

    with open(path, 'wb') as f:
        for _ in range(3715948544 * 2):
            f.write(fill)

创建一个7 Gb文件需要大约40分钟。我如何加快速度？

我已经看过其他问题，但似乎没有一个问题相关：

Allocate a file of particular size in Linux with python - 没有回答
create file of particular size in python - 文件填充\0或解决方案仅限Windows
How to create a file with a given size in Linux? - 所有解决方案都是特定于Linux的

Answer 1

写入块，而不是字节，并避免无缘无故地迭代巨大的range。

import itertools

def reset_storage(self, path):
    total = 3715948544 * 2
    block_size = 4096  # Tune this if needed, just make sure it's a factor of the total
    fill = b'\xFF' * block_size

    with open(path, 'wb') as f:
        f.writelines(itertools.repeat(fill, total // block_size))
        # If you want to handle initialization of arbitrary totals without
        # needing to be careful that block_size evenly divides total, add
        # a single:
        # f.write(fill[:total % block_size])
        # here to write out the incomplete block.

理想的块大小因系统而异。一个合理的选择是使用io.DEFAULT_BUFFER_SIZE自动匹配写入到刷新，同时仍然保持较低的内存使用率。

Answer 2

你的问题是经常调用python方法（对于每个字节！）。我提供的肯定不是完美的，但会更快，多快。请尝试以下方法：

SELECT * 
FROM data.example 
WHERE TIMESTAMP(timeCollected) > DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -60, 'MINUTE')

如何有效地分配预定义大小的文件并使用Python将其填充为非零值？

2 个答案: