我正在构建一个非常长的字符串(~1G)的字典,其中key是固定长度的k-mer,value是所有出现位置。当k很大(> 9)时,预先构建k-mer字典是没有意义的,因为并非所有的值都会出现&它使桌子膨胀。
目前我正在做这样的任务:
def hash_string(st, mersize):
stsize = len(st)
hash = {}
r = stsize-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
if mer in hash:
hash[mer].append(i)
else:
hash[mer] = [i]
return hash
# test for function hash_st above
mer3 = hash_string("ABCDABBBBBAAACCCCABCDDDD", 3)
最耗时的步骤(我做过cProfile)是查找遇到的键(当我们沿着字符串移动时),是新键还是已经存在。最快的方法是什么?
(我目前正在测试一个避免这一步骤的两遍策略(对于大型序列来说要快得多),我首先通过简单地覆盖双精度来构建密钥列表。然后我不会必须检查密钥的存在 - 我用这些密钥来种下我的字典,然后在第二遍时只需在我遇到它们时附加。)
但是我仍然有兴趣知道,总结一下,在Python中查找dict键的最快方法,因为这是一种常见的模式:
如果密钥存在,请附加新条目,否则,创建密钥&添加第一个元素。
这种模式实施最快的是什么?
答案 0 :(得分:8)
import collections
...
hash = collections.defaultdict(list)
r = stsize-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
hash[mer].append(i)
虽然从未对if ... else
进行过分析。
答案 1 :(得分:4)
通常,使用的方法取决于您的数据。我构建了一些简单的测试,它们使用不同类型的数据来说明时间如何变化。
使用的字符串:
以下是一些测试各种方法的快速代码(由于它似乎是最快的,我已经投了defaultdict
个答案。)
import random
from timeit import Timer
from collections import defaultdict
def test_check_first(st, mersize):
""" Look for the existance of the mer in the dict.
"""
mer_hash = {}
r = len(st)-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
if mer in mer_hash:
mer_hash[mer].append(i)
else:
mer_hash[mer] = [i]
return mer_hash
def test_throw_exception(st, mersize):
""" Catch the KeyError thown if a mer doesn't exist in the dict.
"""
mer_hash = {}
r = len(st)-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
try:
mer_hash[mer].append(i)
except KeyError:
mer_hash[mer] = [i]
return mer_hash
def test_defaultdict(st, mersize):
""" Use a defaultdict.
"""
mer_hash = defaultdict(list)
r = len(st)-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
mer_hash[mer].append(i)
return mer_hash
def test_dict_setdefault(st, mersize):
""" Use dict's setdefault method
"""
mer_hash = {}
r = len(st)-mersize+1
for i in range(0, r):
mer = st[i:i+mersize]
mer_hash.setdefault(mer, []).append(i)
return mer_hash
def gen_pseudorandom_string(size):
""" Generate a larger, more "random" string of data.
"""
# only use four letters
letters = ('A', 'B', 'C', 'D')
return ''.join(random.choice(letters) for i in range(size))
if __name__ == '__main__':
# test functions
test_strings = ('ABCDABBBBBAAACCCCABCDDDD', gen_pseudorandom_string(1000), 'A'*100 + 'B'*100 + 'C'*100 + 'D'*100)
mer_size = 3
passes = 10000
for string in test_strings:
display_string = string if len(string) <= 30 else string[:30] + '...'
print 'Testing with string: "' + display_string + '" and mer size: ' + str(mer_size) + ' and number of passes: ' + str(passes)
t1 = Timer("test_check_first(string, mer_size)", "from __main__ import test_check_first, string, mer_size")
print '\tResults for test_check_first: ',
print "%.2f usec/pass" % (1000000 * t1.timeit(number=passes)/passes)
t2 = Timer("test_throw_exception(string, mer_size)", "from __main__ import test_throw_exception, string, mer_size")
print '\tResults for test_throw_exception: ',
print "%.2f usec/pass" % (1000000 * t2.timeit(number=passes)/passes)
t3 = Timer("test_defaultdict(string, mer_size)", "from __main__ import test_defaultdict, string, mer_size")
print '\tResults for test_defaultdict: ',
print "%.2f usec/pass" % (1000000 * t3.timeit(number=passes)/passes)
t4 = Timer("test_dict_setdefault(string, mer_size)", "from __main__ import test_dict_setdefault, string, mer_size")
print '\tResults for test_dict_setdefault: ',
print "%.2f usec/pass" % (1000000 * t4.timeit(number=passes)/passes)
以下是我在机器上运行时得到的结果:
Testing with string: "ABCDABBBBBAAACCCCABCDDDD" and mer size: 3 and number of passes: 10000
Results for test_check_first: 8.70 usec/pass
Results for test_throw_exception: 22.78 usec/pass
Results for test_defaultdict: 10.61 usec/pass
Results for test_dict_setdefault: 8.88 usec/pass
Testing with string: "BACDDDADAAABADBDADDBBBCAAABBBC..." and mer size: 3 and number of passes: 10000
Results for test_check_first: 305.19 usec/pass
Results for test_throw_exception: 320.62 usec/pass
Results for test_defaultdict: 254.56 usec/pass
Results for test_dict_setdefault: 342.55 usec/pass
Testing with string: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..." and mer size: 3 and number of passes: 10000
Results for test_check_first: 114.23 usec/pass
Results for test_throw_exception: 107.96 usec/pass
Results for test_defaultdict: 94.11 usec/pass
Results for test_dict_setdefault: 125.72 usec/pass
答案 2 :(得分:3)
字典具有detdefault
方法,可以满足您的需求,但不确定它的速度会快多少。
所以你的新模式可以是:
hash.setdefault(mer, []).append(i)
答案 3 :(得分:1)
您还可以尝试基于异常的方法:
# Python2-3 compatibility
try: xrange
except NameError: xrange= range
for i in xrange(r):
mer = st[i:i+mersize]
try: hash[mer].append(i)
except KeyError: # not there
hash[mer]= [i]
请注意,如果找不到mer
的大部分时间,这将比您的方法慢,但如果找到它的大部分时间都会更快。您了解自己的数据并可以做出选择
此外,最好不要屏蔽像hash
这样的内置函数。