我们的想法是比较计算哈希值的两个函数的时间:自定义函数calculate_hash
和calculatehash
模块提供的方法py.path
。
辅助函数construct_objects
创建用于计算散列的虚拟对象。
我认为我的时序代码存在一些错误,因为时序的结果非常大(即使是calcultehash
方法)。
def construct_objects(directory, test=False):
"""Creates files objects and writes in each of them binary data
directory: object oriented interface to os.path module
if test=True - asserts equality between hashes calculated
by calculate_hash function and path.computehash method
returns iterator of paths as strings of created file objects
"""
paths = []
for ind in range(10):
path = directory.join(str(ind))
data = bytearray(random.choice([0, 1]) for dummy_ind
in range(100000))
path.write_binary(data)
paths.append(path)
if test:
for path in paths:
assert path.computehash() == calculate_hash(path)
paths = map(str, paths)
return paths
if __name__ == '__main__':
tmp = py.path.local('/tmp')
tmp = tmp.mkdtemp()
paths = list(construct_objects(tmp, test=True))
setup1 = 'from duplicates import calculate_hash; from __main__ import paths'
stmt1 = 'for path in paths: calculate_hash(path)'
timer1 = timeit.Timer(stmt=stmt1, setup=setup1)
print(timer1.repeat())
setup2 = 'import py.path; from __main__ import paths'
stmt2 = 'for path in paths: py.path.local(path).computehash()'
timer2 = timeit.Timer(stmt=stmt2, setup=setup2)
print(timer2.repeat())
更新:
我按照@Peilonrayz给出的建议,用不同大小的块测试了calculate_hash
函数。所有测试使用相同的输入 - 每个100 kB的10个文件,重复次数为1000,重复= 3
Function `calculate_hash` with chunk_size = 524288:
[3.166437058000156, 3.1616951290016004, 3.1614671890001773]
Function `calculate_hash` with chunk_size = 4098:
[3.173143865000384, 3.1707858160007163, 3.1684196309997787]
Function `calculate_hash` with chunk_size = md5.blocksize (64):
[3.17174992100081, 3.1530033399994863, 3.152337475999957]
Method `compute_hash`:
[4.248858199000097, 4.240910821001307, 4.2475174219998735]