Question

问题：

我多次运行相同的python脚本
有时输入数据会发生变化（从某些文件中提取）
我想记录数据是否有变化
这样做我散列数据并保存哈希码
如果哈希值不同，我知道数据有变化
在另一个地方我从文件中保存连接 - ＆gt;哈希码

我写了这个函数来跟踪每次运行脚本时对数据的更改。

    def track_data_change_hash(self, data):
    try:
        import hashlib
        data_hash = hashlib.md5(str(data).encode('utf-8')).hexdigest()
        self.track("the_hash", data_hash[:12])
    except:
        print('failed to create dataset hash')

我的问题是，有时输入数据可能很大（100GB），这会失败。

我该如何处理？有什么好主意吗？（考虑获取文件/输入数据的第一个XMB，然后哈希

Answer 1

您需要以合适大小的块读取文件：

import hashlib


def md5_for_file(your_data, block_size=2048):

    # it means you read 2048 chunk, 2048 chunk step by step ... you can 
    # change it with your own suitable size

    md5 = hashlib.md5()

    while True:
        data = your_data.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

你也可以使用seek（）和read（）函数读取一些字节以获得更好的RAM性能：

with open("1.txt", "rb") as raw_data:
    raw_data.seek(0) 
    output_data = raw_data.read(12)

# it means you just read 12 bytes of file, then you can just hash this part 
# of your own data and check it with your DB ...

祝你好运。

使用hashlib.md5跟踪python中的数据更改

1 个答案: