Question

我们的想法是比较计算哈希值的两个函数的时间：自定义函数calculate_hash和calculatehash模块提供的方法py.path。

辅助函数construct_objects创建用于计算散列的虚拟对象。

我认为我的时序代码存在一些错误，因为时序的结果非常大（即使是calcultehash方法）。

def construct_objects(directory, test=False):
      """Creates files objects and writes in each of them binary data

      directory: object oriented interface to os.path module

      if test=True - asserts equality between hashes calculated 
                     by calculate_hash function and path.computehash method

      returns iterator of paths as strings of created file objects
      """

      paths = []
      for ind in range(10):
            path = directory.join(str(ind))
            data = bytearray(random.choice([0, 1]) for dummy_ind 
                             in range(100000))
            path.write_binary(data)
            paths.append(path)

      if test:      
            for path in paths:
                  assert path.computehash() == calculate_hash(path)
      paths = map(str, paths)
      return paths

if __name__ == '__main__':
      tmp = py.path.local('/tmp')
      tmp = tmp.mkdtemp()
      paths = list(construct_objects(tmp, test=True))

      setup1 = 'from duplicates import calculate_hash; from __main__ import paths'
      stmt1 = 'for path in paths: calculate_hash(path)'
      timer1 = timeit.Timer(stmt=stmt1, setup=setup1)
      print(timer1.repeat())

      setup2 = 'import py.path; from __main__ import paths'
      stmt2 = 'for path in paths: py.path.local(path).computehash()'
      timer2 = timeit.Timer(stmt=stmt2, setup=setup2)
      print(timer2.repeat())

更新：

我按照@Peilonrayz给出的建议，用不同大小的块测试了calculate_hash函数。所有测试使用相同的输入 - 每个100 kB的10个文件，重复次数为1000，重复= 3

Function `calculate_hash` with chunk_size = 524288:
 [3.166437058000156, 3.1616951290016004, 3.1614671890001773]

Function `calculate_hash` with chunk_size = 4098:
 [3.173143865000384, 3.1707858160007163, 3.1684196309997787]

Function `calculate_hash` with chunk_size = md5.blocksize (64): 
[3.17174992100081, 3.1530033399994863, 3.152337475999957]

Method `compute_hash`:  
[4.248858199000097, 4.240910821001307, 4.2475174219998735]

散列函数的时序值不合理

0 个答案: