假设你有一个像这样的DNA序列:
AATCRVTAA
其中R
和V
是DNA核苷酸的模糊值,其中R
代表A
或G
,V
代表{{ 1}},A
或C
。
是否有Biopython方法生成可由上述模糊序列表示的所有不同序列组合?
例如,输出为:
G
答案 0 :(得分:3)
我最终编写了自己的函数:
from Bio import Seq
from itertools import product
def extend_ambiguous_dna(seq):
"""return list of all possible sequences given an ambiguous DNA input"""
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
r = []
for i in product(*[d[j] for j in seq]):
r.append("".join(i))
return r
In [1]: extend_ambiguous_dna("AV")
Out[1]: ['AA', 'AC', 'AG']
它允许您使用
生成给定大小的每个模式In [2]: extend_ambiguous_dna("NN")
Out[2]: ['GG', 'GA', 'GT', 'GC',
'AG', 'AA', 'AT', 'AC',
'TG', 'TA', 'TT', 'TC',
'CG', 'CA', 'CT', 'CC']
希望这会节省别人的时间!
答案 1 :(得分:3)
也许是一种更短更快的方式,因为无论如何,这个函数将被用于非常大的数据:
from Bio import Seq
from itertools import product
def extend_ambiguous_dna(seq):
"""return list of all possible sequences given an ambiguous DNA input"""
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
return [ list(map("".join, product(*map(d.get, seq)))) ]
使用map
允许循环在C中执行而不是在Python中执行。这应该比使用普通循环甚至列表推导要快得多。
使用简单的词典d
而不是ambiguous_na_values
返回的词典
from itertools import product
import time
d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"
# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()
# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()
lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000
print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))
输出:
# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms
# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms
# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms
# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms
# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms
# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms
显然map
更好,但只有2或3倍。可以肯定它可以进一步优化。
答案 2 :(得分:0)
我不确定是否采用了biopython的方式,但这里有一个使用itertools的方式:
s = "AATCRVTAA"
ambig = {"R": ["A", "G"], "V":["A", "C", "G"]}
groups = itertools.groupby(s, lambda char:char not in ambig)
splits = []
for b,group in groups:
if b:
splits.extend([[g] for g in group])
else:
for nuc in group:
splits.append(ambig[nuc])
answer = [''.join(p) for p in itertools.product(*splits)]
输出:
In [189]: answer
Out[189]: ['AATCAATAA', 'AATCACTAA', 'AATCAGTAA', 'AATCGATAA', 'AATCGCTAA', 'AATCGGTAA']
答案 3 :(得分:0)
还有一个itertools解决方案:
from itertools import product
import re
lu = {'R':'AG', 'V':'ACG'}
def get_seqs(seq):
seqs = []
nrepl = seq.count('R') + seq.count('V')
sp_seq = [a for a in re.split(r'(R|V)', seq) if a]
pr_terms = [lu[a] for a in sp_seq if a in 'RV']
for cmb in product(*pr_terms):
seqs.append(''.join(sp_seq).replace('R', '%s').replace('V', '%s') % cmb)
return seqs
seq = 'AATCRVTAA'
print 'seq: ', seq
print '\n'.join(get_seqs(seq))
seq1 = 'RAATCRVTAAR'
print 'seq: ', seq1
print '\n'.join(get_seqs(seq1))
seq: AATCRVTAA
AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
seq: RAATCRVTAAR
AAATCAATAAA
AAATCAATAAG
AAATCACTAAA
AAATCACTAAG
AAATCAGTAAA
AAATCAGTAAG
AAATCGATAAA
AAATCGATAAG
AAATCGCTAAA
AAATCGCTAAG
AAATCGGTAAA
AAATCGGTAAG
GAATCAATAAA
GAATCAATAAG
GAATCACTAAA
GAATCACTAAG
GAATCAGTAAA
GAATCAGTAAG
GAATCGATAAA
GAATCGATAAG
GAATCGCTAAA
GAATCGCTAAG
GAATCGGTAAA
GAATCGGTAAG