我正在尝试使用needleman-wunsh算法进行成对比对。我已经能够创建score_matrix和traceback_matrix,但是我无法理解遍历回溯矩阵的回溯函数。我的主要问题是我不明白为什么我会收到这个错误,虽然看起来很简单:
--------------------------------------
IndexErrorTraceback (most recent call last)
<ipython-input-378-758f4e7985a2> in <module>()
68
69
---> 70 traceback(traceback_matrix, score_matrix, seq1, seq2)
<ipython-input-378-758f4e7985a2> in traceback(traceback_matrix, score_matrix, seq1, seq2, start_row, start_col)
30 for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
31 print("aligned_seq:", aligned_seq, "input_seq:", input_seq)
---> 32 aligned_seq.append(str(input_seq[current_col-1]))
33 print("aligned_seq_new:", aligned_seq)
34 for aligned_seq, input_seq in zip(aligned_seqs2, seq2):
IndexError: string index out of range
这是功能:
def traceback(traceback_matrix, score_matrix, seq1, seq2, start_row=start_row, start_col=start_col):
"""Traverse traceback_matrix to get optimal alignments
"""
_traceback_encoding = {'match': 1, 'vertical-gap': 2, 'horizontal-gap': 3,
'uninitialized': -1, 'alignment-end': 0}
aend = _traceback_encoding['alignment-end']
match = _traceback_encoding['match']
vgap = _traceback_encoding['vertical-gap']
hgap = _traceback_encoding['horizontal-gap']
uninitialized = _traceback_encoding['uninitialized']
gap_character = '-'
aligned_seqs1 = [[] for e in range(len(seq1))]
aligned_seqs2 = [[] for e in range(len(seq2))]
current_row = start_row
#print("current_row:", current_row)
current_col = start_col
best_score = score_matrix[current_row, current_col]
print("best_score:", best_score)
current_value = None
while current_value != aend:
current_value = traceback_matrix[current_row, current_col]
print("current_value:", current_value, "current_row:", current_row, "current_col:", current_col)
if current_value == match:
for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
print("aligned_seq:", aligned_seq, "input_seq:", input_seq)
aligned_seq.append(str(input_seq[current_col-1]))
print("aligned_seq_new:", aligned_seq)
for aligned_seq, input_seq in zip(aligned_seqs2, seq2):
#print("aligned_seq2:", aligned_seq, "input_seq2:", input_seq)
aligned_seq.append(str(input_seq[current_row-1]))
current_row -= 1
current_col -= 1
elif current_value == vgap:
for aligned_seq in aligned_seqs1:
aligned_seq.append(gap_character)
for aligned_seq, input_seq in zip(aligned_seqs2, seq2):
aligned_seq.append(str(input_seq[current_row-1]))
current_row -= 1
elif current_value == hgap:
for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
aligned_seq.append(str(input_seq[current_col-1]))
for aligned_seq in aligned_seqs2:
aligned_seq.append(gap_character)
current_col -= 1
elif current_value == aend:
continue
else:
raise ValueError(
"Invalid value in traceback matrix: %s" % current_value)
for i in range(len(seq1)):
aligned_seq = ''.join(aligned_seqs1[i][::-1])
constructor = str
aligned_seqs1[i] = constructor(aligned_seq)
for i in range(len(seq2)):
aligned_seq = ''.join(aligned_seqs2[i][::-1])
constructor = str
aligned_seqs2[i] = constructor(aligned_seq)
return aligned_seqs1, aligned_seqs2, best_score, current_col, current_row
输入:得分矩阵,回溯矩阵,sequence1,sequence2,start_row,start_column
score_matrix:
(array([[ 0., -2., -4., -6., -8., -10., -12., -14., -16., -18., -20.],
[ -2., -2., -3., -5., -7., -9., -11., -13., -15., -17., -19.],
[ -4., -4., -3., 1., -1., -3., -5., -7., -9., -11., -13.],
[ -6., -6., -5., -1., -1., -3., 8., 6., 4., 2., 0.],
[ -8., 2., 0., -2., -3., -3., 6., 6., 14., 12., 10.],
[-10., 0., 7., 5., 3., 1., 4., 4., 12., 19., 17.],
[-12., -2., 5., 11., 9., 7., 5., 4., 10., 17., 18.],
[-14., -4., 3., 9., 9., 8., 6., 4., 8., 15., 22.]])
traceback_matrix:
array([[0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[2, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3],
[2, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3],
[2, 1, 2, 2, 1, 3, 1, 3, 3, 3, 3],
[2, 1, 3, 3, 1, 1, 2, 1, 1, 3, 3],
[2, 2, 1, 3, 3, 3, 2, 1, 2, 1, 3],
[2, 2, 2, 1, 3, 3, 3, 1, 2, 2, 1],
[2, 2, 1, 2, 1, 1, 3, 3, 2, 1, 1]]))
seq1和seq2:
seq1 = "HEAGAWGHEE"
seq2 = "PAWHEAE"
start_row和start_col:
start_row = score_matrix.shape[0]-1
start_col = score_matrix.shape[1]-1
输出应该是以下对齐的序列及其得分:
HEAGAWGHE-E
-PA--W-HEAE
1.0
理想情况下,错误发生的地方,我们希望将input_seq放入aligned_seqs1的current_col-1位置。
提前致谢。
答案 0 :(得分:0)
如果运行提供的代码,您可以在输出中看到这一行:
aligned_seq: [] input_seq: H
这是由引发异常的行上方的print
生成的。
让我们仔细看看:
for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
print("aligned_seq:", aligned_seq, "input_seq:", input_seq)
aligned_seq.append(str(input_seq[current_col-1])) # <--- Error
我只能猜测,您的假设是zip
会将aligned_seqs1
中的每个元素与seq1
连接起来,但实际上它会将seq1
分解为单个字符,因此input_seq
总是一个字符长。这就是为什么引用除了第一个字符之外的任何东西都会引发错误。
如果您只是评论引发错误的行,您会看到完整的输出,并且您会看到input_seq
逐个保存来自初始seq1
的每个字符。
我希望这有助于您取得进一步进展。如果您需要任何其他帮助,请告诉我。