我试图获得一系列序列中包含的插入和缺口的数量与它们对齐的参考;因此,所有序列现在都具有相同的长度。
例如
>reference
AGCAGGCAAGGCAA--GGAA-CCA
>sequence1
AAAA---AAAGCAATTGGAA-CCA
>sequence2
AGCAGGCAAAACAA--GGAAACCA
在此示例中,sequence1有两个插入(两个T)和三个间隙。不应计算最后一个间隙,因为它出现在reference和sequence1中。 Sequence2有一个插入(最后一个三元组之前的A)并且没有间隙。 (同样,差距与参考共享,不应输入计数。)。序列1中还有3个多态性,序列2中也有2个。
我目前的脚本能够估计差异,但不能计算相关差距和插入的数量"如上所述。例如
records = list(SeqIO.parse(file("sequences.fasta"),"fasta"))
reference = records[0] #reference is the first sequence in the file
del records[0]
for record in records:
gaps = record.seq.count("-") - reference.seq.count("-")
basesinreference = reference.seq.count("A") + reference.seq.count("C") + reference.seq.count("G") + reference.seq.count("T")
basesinsequence = record.seq.count("A") + record.seq.count("C") + record.seq.count("G") + record.seq.count("T")
print(record.id)
print(gaps)
print(basesinsequence - basesinreference)
#Gives
sequence1
1 #Which means sequence 1 has one more Gap than the reference
-1 #Which means sequence 1 has one base less than the reference
sequence2
-1 #Which means sequence 2 has one Gap less than the reference
1 #Which means sequence 2 has one more base than the reference
我是一个Python新手,仍在学习这门语言的工具。有没有办法实现这个目标?我正在考虑拆分序列并迭代地一次比较一个位置并计算差异,但我不确定它是否可能在Python中(更不用说它会非常慢)。
答案 0 :(得分:1)
这是zip
功能的工作。我们并行迭代引用和测试序列,看看是否在当前位置包含-
。我们使用该测试的结果更新字典中的插入,删除和未更改的计数。
def kind(u, v):
if u == '-':
if v != '-':
return 'I' # insertion
else:
if v == '-':
return 'D' # deletion
return 'U' # unchanged
reference = 'AGCAGGCAAGGCAA--GGAA-CCA'
sequences = [
'AGCA---AAGGCAATTGGAA-CCA',
'AGCAGGCAAGGCAA--GGAAACCA',
]
print('Reference')
print(reference)
for seq in sequences:
print(seq)
counts = dict.fromkeys('DIU', 0)
for u, v in zip(reference, seq):
counts[kind(u, v)] += 1
print(counts)
<强>输出强>
Reference
AGCAGGCAAGGCAA--GGAA-CCA
AGCA---AAGGCAATTGGAA-CCA
{'I': 2, 'D': 3, 'U': 19}
AGCAGGCAAGGCAA--GGAAACCA
{'I': 1, 'D': 0, 'U': 23}
这是一个更新版本,也会检查多态性。
def kind(u, v):
if u == '-':
if v != '-':
return 'I' # insertion
else:
if v == '-':
return 'D' # deletion
elif v != u:
return 'P' # polymorphism
return 'U' # unchanged
reference = 'AGCAGGCAAGGCAA--GGAA-CCA'
sequences = [
'AAAA---AAAGCAATTGGAA-CCA',
'AGCAGGCAAAACAA--GGAAACCA',
]
print('Reference')
print(reference)
for seq in sequences:
print(seq)
counts = dict.fromkeys('DIPU', 0)
for u, v in zip(reference, seq):
counts[kind(u, v)] += 1
print(counts)
<强>输出强>
Reference
AGCAGGCAAGGCAA--GGAA-CCA
AAAA---AAAGCAATTGGAA-CCA
{'D': 3, 'P': 3, 'I': 2, 'U': 16}
AGCAGGCAAAACAA--GGAAACCA
{'D': 0, 'P': 2, 'I': 1, 'U': 21}
答案 1 :(得分:1)
使用Biopython和numpy:
from Bio import AlignIO
from collections import Counter
import numpy as np
alignment = AlignIO.read("alignment.fasta", "fasta")
events = []
for i in range(alignment.get_alignment_length()):
this_column = alignment[:, i]
# Mark insertions, polymorphism and deletions following PM 2Ring notation
events.append(["U" if b == this_column[0] else
"I" if this_column[0] == "-" else
"P" if b != "-" else
"D" for b in this_column])
# Apply a Counter over the columns (axis 0) of the array
print(np.apply_along_axis(Counter, 0, np.array(events)))
这应该以与对齐相同的顺序输出Counts数组:
[[Counter({'U': 23})
Counter({'U': 15, 'P': 3, 'D': 3, 'I': 2})
Counter({'U': 21, 'P': 2, 'I': 1})]]