我正在研究一个简单的生物信息学问题。我有一个有效的解决方案,但这是非常低效的。我怎样才能提高效率?
问题:
根据 E
,在字符串 k
中查找长度为 g
的模式-mer最多可能有 k
不匹配。
这些字符串和模式都是基因组的 - 所以我们可能的字符集是d
。
我将调用函数{A, T, C, G}
。
所以,这里有一些有用的例子:
FrequentWordsMismatch(g, k, d)
→FrequentWordsMismatch('AAAAAAAAAA', 2, 1)
这是一个更长的例子,如果你实现这个并想要测试:
['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']
→FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2)
凭借我天真的解决方案,第二个例子很容易花费约60秒,但第一个例子非常快。
天真的解决方案:
我的想法是,对于g中的每个 k - 长度段,找到每个可能的“邻居”(例如,其他 k - 长度段,最多 d 不匹配)并将这些邻居添加为字典的键。然后,我计算每个邻居kmers在字符串 g 中出现的次数,并将其记录在字典中。
显然,这是一种有点糟糕的方式,因为邻居的数量像 k 和 d 一样疯狂地扩展,并且必须扫描每个字符串这些邻居使得这种实施非常缓慢。但是,这就是我要求帮助的原因。
我会把我的代码放在下面。打包时肯定会有很多新手的错误,所以感谢你的时间和精力。
['GCACACAGAC', 'GCGCACACAC']
更多测试
def FrequentWordsMismatch(g, k, d):
'''
Finds the most frequent k-mer patterns in the string g, given that those
patterns can mismatch amongst themselves up to d times
g (String): Collection of {A, T, C, G} characters
k (int): Length of desired pattern
d (int): Number of allowed mismatches
'''
counts = {}
answer = []
for i in range(len(g) - k + 1):
kmer = g[i:i+k]
for neighborkmer in Neighbors(kmer, d):
counts[neighborkmer] = Count(neighborkmer, g, d)
maxVal = max(counts.values())
for key in counts.keys():
if counts[key] == maxVal:
answer.append(key)
return(answer)
def Neighbors(pattern, d):
'''
Find all strings with at most d mismatches to the given pattern
pattern (String): Original pattern of characters
d (int): Number of allowed mismatches
'''
if d == 0:
return [pattern]
if len(pattern) == 1:
return ['A', 'C', 'G', 'T']
answer = []
suffixNeighbors = Neighbors(pattern[1:], d)
for text in suffixNeighbors:
if HammingDistance(pattern[1:], text) < d:
for n in ['A', 'C', 'G', 'T']:
answer.append(n + text)
else:
answer.append(pattern[0] + text)
return(answer)
def HammingDistance(p, q):
'''
Find the hamming distance between two strings
p (String): String to be compared to q
q (String): String to be compared to p
'''
ham = 0 + abs(len(p)-len(q))
for i in range(min(len(p), len(q))):
if p[i] != q[i]:
ham += 1
return(ham)
def Count(pattern, g, d):
'''
Count the number of times that the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
return len(MatchWithMismatch(pattern, g, d))
def MatchWithMismatch(pattern, g, d):
'''
Find the indicies at which the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
answer = []
for i in range(len(g) - len(pattern) + 1):
if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
answer.append(i)
return(answer)
答案 0 :(得分:2)
仅针对您的问题描述 而不是您的示例(由于我在评论中解释的原因),一种方法是:
s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
"CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
"CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
"GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
"ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
def frequent_words_mismatch(g,k,d):
def num_misspellings(x,y):
return sum(xx != yy for (xx,yy) in zip(x,y))
seen = set()
for i in range(len(g)-k+1):
seen.add(g[i:i+k])
# For each unique sequence, add a (key,bin) pair to the bins dictionary
# (The bin is initialized to a list containing only the sequence, for now)
bins = {seq:[seq,] for seq in seen}
# Loop again through the unique sequences...
for seq in seen:
# Try to fit it in *all* already-existing bins (based on bin key)
for bk in bins:
# Don't re-add seq to it's own bin
if bk == seq: continue
# Test bin keys, try to find all appropriate bins
if num_misspellings(seq, bk) <= d:
bins[bk].append(seq)
# Get a list of the bin keys (one for each unique sequence) sorted in order of the
# number of elements in the corresponding bins
sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)
# largest_bin_key will be the key of the largest bin (there may be ties, so in fact
# this is *a* key of *one of the bins with the largest length*). That is, it'll
# be the sequence (found in the string) that the most other sequences (also found
# in the string) are at most d-distance from.
largest_bin_key = sorted_keys[0]
# You can return this bin, as your question description (but not examples) indicate:
return bins[largest_bin_key]
largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin)) # 13
print(largest_bin)
(这个)最大的bin包含:
['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG', 'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG', 'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
O (n ** 2)其中n是唯一序列的数量,并在我的计算机上完成约0.1秒。
答案 1 :(得分:2)
问题描述在几个方面是模棱两可的,所以我将通过这些例子。您似乎希望字母k
中的所有(A, C, G, T}
- 长度字符串使得g
的连续子字符串的匹配数最大 - 其中&#34;匹配&#34;表示最多d
个字符不等式的逐字符相等。
我忽略了即使输入的长度不同,你的HammingDistance()
函数也会产生一些东西,主要是因为它对我来说没有多大意义;-),但部分原因是因为它不是&# 39; t需要在你给出的任何例子中得到你想要的结果。
下面的代码会生成您在所有示例中所需的结果,从而产生您提供的输出列表的排列。如果你想要规范输出,我建议在返回之前对输出列表进行排序。
该算法非常简单,但依靠itertools
进行重组合提升&#34; C速度&#34;。所有的例子都在第二个总数下运行良好。
对于k
的每个长度 - g
个连续子字符串,请考虑所有combinations(k, d)
个d
个不同索引位置的集合。有4**d
种方法可以使用{A, C, G, T}
中的字母填充这些索引位置,每种方式都是&#34;模式&#34;匹配最多d
个差异的子字符串。通过记住已生成的模式来清除重复项;这比制作英雄的努力更快,只能生成独特的模式。
总而言之,时间要求为O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d
,其中k**d
为k
和d
的合理较小值,夸大了二项式系数combinations(k, d)
。需要注意的重要一点是 - 毫不奇怪 - 它在d
中是指数级的。
def fwm(g, k, d):
from itertools import product, combinations
from collections import defaultdict
all_subs = list(product("ACGT", repeat=d))
all_ixs = list(combinations(range(k), d))
patcount = defaultdict(int)
for starti in range(len(g)):
base = g[starti : starti + k]
if len(base) < k:
break
patcount[base] += 1
seen = set([base])
basea = list(base)
for ixs in all_ixs:
saved = [basea[i] for i in ixs]
for newchars in all_subs:
for i, newchar in zip(ixs, newchars):
basea[i] = newchar
candidate = "".join(basea)
if candidate not in seen:
seen.add(candidate)
patcount[candidate] += 1
for i, ch in zip(ixs, saved):
basea[i] = ch
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
不是通过保留一组到目前为止看到的重复来清除重复项,而是直截了当地防止生成重复项。事实上,下面的代码更短更简单,虽然有点微妙。作为少量冗余工作的回报,有inner()
函数的递归调用层。哪种方式更快似乎取决于具体的输入。
def fwm(g, k, d):
from collections import defaultdict
patcount = defaultdict(int)
alphabet = "ACGT"
allbut = {ch: tuple(c for c in alphabet if c != ch)
for ch in alphabet}
def inner(i, rd):
if not rd or i == k:
patcount["".join(base)] += 1
return
inner(i+1, rd)
orig = base[i]
for base[i] in allbut[orig]:
inner(i+1, rd-1)
base[i] = orig
for i in range(len(g) - k + 1):
base = list(g[i : i + k])
inner(0, d)
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]