更新字典和最快的最快方式检查钥匙

时间:2011-04-13 18:46:28

标签: python performance dictionary append

我正在构建一个非常长的字符串(~1G)的字典,其中key是固定长度的k-mer,value是所有出现位置。当k很大(> 9)时,预先构建k-mer字典是没有意义的,因为并非所有的值都会出现&它使桌子膨胀。

目前我正在做这样的任务:

def hash_string(st, mersize):

    stsize = len(st)
    hash = {}
    r = stsize-mersize+1

    for i in range(0, r):
        mer = st[i:i+mersize]
        if mer in hash:
            hash[mer].append(i)
        else:
            hash[mer] = [i]

    return hash

# test for function hash_st above        
mer3 = hash_string("ABCDABBBBBAAACCCCABCDDDD", 3) 

最耗时的步骤(我做过cProfile)是查找遇到的键(当我们沿着字符串移动时),是新键还是已经存在。最快的方法是什么?

(我目前正在测试一个避免这一步骤的两遍策略(对于大型序列来说要快得多),我首先通过简单地覆盖双精度来构建密钥列表。然后我不会必须检查密钥的存在 - 我用这些密钥来种下我的字典,然后在第二遍时只需在我遇到它们时附加。)

但是我仍然有兴趣知道,总结一下,在Python中查找dict键的最快方法,因为这是一种常见的模式:

如果密钥存在,请附加新条目,否则,创建密钥&添加第一个元素。

这种模式实施最快的是什么?

4 个答案:

答案 0 :(得分:8)

我会使用collections.defaultdict

import collections
...
hash = collections.defaultdict(list)
r = stsize-mersize+1

for i in range(0, r):
    mer = st[i:i+mersize]
    hash[mer].append(i)

虽然从未对if ... else进行过分析。

答案 1 :(得分:4)

通常,使用的方法取决于您的数据。我构建了一些简单的测试,它们使用不同类型的数据来说明时间如何变化。

使用的字符串:

  1. 问题中的字符串。
  2. 一个较大的伪随机字符串(假设哈希中有更多不同的mers / keys)。
  3. 哈希中很少有不同的mers / key的字符串。
  4. 以下是一些测试各种方法的快速代码(由于它似乎是最快的,我已经投了defaultdict个答案。)

    import random
    from timeit import Timer
    from collections import defaultdict
    
    def test_check_first(st, mersize):
        """ Look for the existance of the mer in the dict.
        """
        mer_hash = {}
        r = len(st)-mersize+1
    
        for i in range(0, r):
            mer = st[i:i+mersize]
            if mer in mer_hash:
                mer_hash[mer].append(i)
            else:
                mer_hash[mer] = [i]
    
        return mer_hash
    
    def test_throw_exception(st, mersize):
        """ Catch the KeyError thown if a mer doesn't exist in the dict.
        """
        mer_hash = {}
        r = len(st)-mersize+1
    
        for i in range(0, r):
            mer = st[i:i+mersize]
            try:
                mer_hash[mer].append(i)
            except KeyError:
                mer_hash[mer] = [i]
    
        return mer_hash
    
    def test_defaultdict(st, mersize):
        """ Use a defaultdict.
        """
        mer_hash = defaultdict(list)
        r = len(st)-mersize+1
    
        for i in range(0, r):
            mer = st[i:i+mersize]
            mer_hash[mer].append(i)
    
        return mer_hash
    
    def test_dict_setdefault(st, mersize):
        """ Use dict's setdefault method
        """
        mer_hash = {}
        r = len(st)-mersize+1
    
        for i in range(0, r):
            mer = st[i:i+mersize]
            mer_hash.setdefault(mer, []).append(i)
    
        return mer_hash
    
    def gen_pseudorandom_string(size):
        """ Generate a larger, more "random" string of data.
        """
        # only use four letters
        letters = ('A', 'B', 'C', 'D')
        return ''.join(random.choice(letters) for i in range(size))
    
    if __name__ == '__main__':
        # test functions
        test_strings = ('ABCDABBBBBAAACCCCABCDDDD', gen_pseudorandom_string(1000), 'A'*100 + 'B'*100 + 'C'*100 + 'D'*100)
        mer_size = 3
        passes = 10000
    
        for string in test_strings:
            display_string = string if len(string) <= 30 else string[:30] + '...'
            print 'Testing with string: "' + display_string + '" and mer size: ' + str(mer_size) + ' and number of passes: ' + str(passes)
    
            t1 = Timer("test_check_first(string, mer_size)", "from __main__ import test_check_first, string, mer_size")
            print '\tResults for test_check_first: ',
            print "%.2f usec/pass" % (1000000 * t1.timeit(number=passes)/passes)
    
            t2 = Timer("test_throw_exception(string, mer_size)", "from __main__ import test_throw_exception, string, mer_size")
            print '\tResults for test_throw_exception: ',
            print "%.2f usec/pass" % (1000000 * t2.timeit(number=passes)/passes)
    
            t3 = Timer("test_defaultdict(string, mer_size)", "from __main__ import test_defaultdict, string, mer_size")    
            print '\tResults for test_defaultdict: ',
            print "%.2f usec/pass" % (1000000 * t3.timeit(number=passes)/passes)
    
            t4 = Timer("test_dict_setdefault(string, mer_size)", "from __main__ import test_dict_setdefault, string, mer_size")    
            print '\tResults for test_dict_setdefault: ',
            print "%.2f usec/pass" % (1000000 * t4.timeit(number=passes)/passes)
    

    以下是我在机器上运行时得到的结果:

    Testing with string: "ABCDABBBBBAAACCCCABCDDDD" and mer size: 3 and number of passes: 10000
        Results for test_check_first:  8.70 usec/pass
        Results for test_throw_exception:  22.78 usec/pass
        Results for test_defaultdict:  10.61 usec/pass
        Results for test_dict_setdefault:  8.88 usec/pass
    Testing with string: "BACDDDADAAABADBDADDBBBCAAABBBC..." and mer size: 3 and number of passes: 10000
        Results for test_check_first:  305.19 usec/pass
        Results for test_throw_exception:  320.62 usec/pass
        Results for test_defaultdict:  254.56 usec/pass
        Results for test_dict_setdefault:  342.55 usec/pass
    Testing with string: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..." and mer size: 3 and number of passes: 10000
        Results for test_check_first:  114.23 usec/pass
        Results for test_throw_exception:  107.96 usec/pass
        Results for test_defaultdict:  94.11 usec/pass
        Results for test_dict_setdefault:  125.72 usec/pass
    

答案 2 :(得分:3)

字典具有detdefault方法,可以满足您的需求,但不确定它的速度会快多少。

所以你的新模式可以是:

hash.setdefault(mer, []).append(i)

答案 3 :(得分:1)

您还可以尝试基于异常的方法:

# Python2-3 compatibility
try: xrange
except NameError: xrange= range

for i in xrange(r):
    mer = st[i:i+mersize]
    try: hash[mer].append(i)
    except KeyError: # not there
        hash[mer]= [i]

请注意,如果找不到mer的大部分时间,这将比您的方法慢,但如果找到它的大部分时间都会更快。您了解自己的数据并可以做出选择 此外,最好不要屏蔽像hash这样的内置函数。