我有一个序列列表l(许多1000个序列):l = [ABCD,AABA,...]
。
我还有一个文件f
,包含许多4个字母序列(大约一百万个)。我想在l
中为f
中的每个序列选择最接近的字符串,直到汉明距离为2,并更新计数器good_count
。我为此编写了以下代码,但速度非常慢。我想知道它是否可以更快地完成。
def hamming(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
f = open("input.txt","r")
l = [ABCD,AABA,...]
good_count = 0
for s in f:
x = f.readline()
dist_array = []
for ll in l:
dist = hamming(x,ll)
dist_array.append(dist)
min_dist = min(dist_array)
if min_dist <= 2:
good_count += 1
print good_count
如果l
和f
较小,则效果很快,但对于大型l
和f
则需要太长时间。请建议更快的解决方案。
答案 0 :(得分:2)
使用现有的库,例如水母:
from jellyfish import hamming_distance
这为你提供了汉明距离的C实现。
答案 1 :(得分:2)
哦,你只计算MANY如何与海明距离匹配&lt; 2?这可以更快地完成。
total_count = 0
for line in f:
# skip the s = f.readline() since that's what `line` is in this
line = line.strip() # just in case
for ll in l:
if hamming(line, ll) <= 2:
total_count += 1
break # skip the rest of the ll in l loop
# and then you don't need any processing afterwards either.
请注意,您的大部分代码时间将用于该行:
if hamming(line, ll) <= 2:
因此,任何可以改进该算法的方法都可以极大地提高整体脚本速度。 Boud的回答颂扬了jellyfish
hamming_distance
功能的优点,但没有任何个人经验,我自己也无法推荐。然而,他建议使用更快的汉明距离实现是合理的!
Peter DeGlopper建议将l
列表分成六组不同的&#34;两个或更少的汉明距离&#34;火柴。也就是说,一组包含所有可能具有两个或更少汉明距离的对的集合。这可能看起来像:
# hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ]
hamming_sets = [ set(), set(), set(), set(), set(), set() ]
for ll in l:
# this should take the lion's share of time in your program
hamming_sets[0].add(l[0] + l[1])
hamming_sets[0].add(l[0] + l[2])
hamming_sets[0].add(l[0] + l[3])
hamming_sets[0].add(l[1] + l[2])
hamming_sets[0].add(l[1] + l[3])
hamming_sets[0].add(l[2] + l[3])
total_count = 0
for line in f:
# and this should be fast, even if `f` is large
line = line.strip()
if line[0]+line[1] in hamming_sets[0] or \
line[0]+line[2] in hamming_sets[1] or \
line[0]+line[3] in hamming_sets[2] or \
line[1]+line[2] in hamming_sets[3] or \
line[1]+line[3] in hamming_sets[4] or \
line[2]+line[3] in hamming_sets[5]:
total_count += 1
您可以通过hamming_sets
transform_function: set_of_results
键值对字典来获得可读性。
hamming_sets = {lambda s: s[0]+s[1]: set(),
lambda s: s[0]+s[2]: set(),
lambda s: s[0]+s[3]: set(),
lambda s: s[1]+s[2]: set(),
lambda s: s[1]+s[3]: set(),
lambda s: s[2]+s[3]: set()}
for func, set_ in hamming_sets.items():
for ll in l:
set_.add(func(ll))
total_count = 0
for line in f:
line = line.strip()
if any(func(line) in set_ for func, set_ in hamming_sets.items()):
total_count += 1