Question

随着我的子串的大小增加，我怎样才能找到这部分代码的复杂性？

if size > 160:
    sub = (hashlib.sha1(sub.encode('utf-8')).hexdigest())

当我注意到我的程序正在运行时，好像哈希函数在恒定时间内执行时，我变得很好奇。对于我的计划，如果＆＃39; size＆＃39;是165，最坏的情况是上面的代码将执行165x。我刚刚完成的测试显示sha1执行时与长度的关系不稳定。

Length  Time
0   0
1   0.015000105
2   0.016000032
3   0.046000004
4   0.046999931
5   0.062000036
6   0.078000069
7   0.078000069
8   0.07799983
9   0.108999968

测试代码：

import string
import random
import hashlib
import time

def randomly(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

for i in range(1, 10000001, 1000000):
    random_str = randomly(i)
    start = time.time()
    str_hash = hashlib.sha1(random_str.encode('utf-8')).hexdigest()
    print time.time() - start

Answer 1

我不同意DarthGizka。以下是来自同一wikipedia article的更多描述：

Pre-processing:
append the bit '1' to the message i.e. by adding 0x80 if characters are 8 bits. 
append 0 ≤ k < 512 bits '0', thus the resulting message length (in bits)
   is congruent to 448 (mod 512)
append ml, in a 64-bit big-endian integer. So now the message length is a multiple of 512 bits.

Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
    break chunk into sixteen 32-bit big-endian words w[i], 0 ≤ i ≤ 15

    Extend the sixteen 32-bit words into eighty 32-bit words:
    for i from 16 to 79
        w[i] = (w[i-3] xor w[i-8] xor w[i-14] xor w[i-16]) leftrotate 1

        ......

填充工作只是一个预处理。在for each chunk内完成了更多工作。由于mattkaeo的数据大小超过1000000个字符（第一个除外），for循环应该消耗最多的时间，而填充的消耗可以忽略不计。

我相信，

mattkaeo的结果不是非常线性的，因为他只运行每个样本一次，因此系统噪声（例如OS和其他进程共享CPU功率）非常重要。我每次运行200次样本：

import timeit
for i in range(1, 10000001, 1000000):
    random_str = randomly(i)
    print timeit.timeit('hashlib.sha1(random_str).hexdigest()',
                        setup='import hashlib; random_str="%s".encode("utf-8")' % random_str,
                        number=200)

结果更加线性：

0.000172138214111
0.303541898727
0.620085954666
0.932041883469
1.29230999947
1.57217502594
1.93531990051
2.24045419693
2.56945014
2.95437908173

Answer 2

SHA-1算法在添加＆＃39; 1＆＃39;之后将输入（＆＃39;消息＆＃39;）填充为512位的倍数。位和输入的大小。根据维基百科中的算法描述：

append the bit '1' to the message i.e. by adding 0x80 if characters are 8 bits
append 0 ≤ k < 512 bits '0'
append ml (message length), in a 64-bit big-endian integer.

这就是为什么运行时间是输入的阶跃函数，保持一段时间不变，然后跳跃。

但是，随着消息大小相对于64字节的块大小（步长）变大，此效果会减弱。

当消息大小接近并超过各种内存高速缓存大小时，会发生其他明显的更改：一级高速缓存（L1）通常为32 KB，L2为256 KB，4或8或甚至20 MB用于L3缓存，从最快到最慢。未缓存的内存访问速度更慢。

在mattkeo的情况下，只要数据没有显着超过缓存大小，散列就会发现缓存变暖（即很多数据仍然存在于缓存中）。温暖缓存和未缓存内存之间的差异往往比命中率较低的不冷或冷缓存更明显。

哈希函数sha1的复杂性是什么

2 个答案: