Question

我正在尝试使用needleman-wunsh算法进行成对比对。我已经能够创建score_matrix和traceback_matrix，但是我无法理解遍历回溯矩阵的回溯函数。我的主要问题是我不明白为什么我会收到这个错误，虽然看起来很简单：

--------------------------------------
IndexErrorTraceback (most recent call last)
<ipython-input-378-758f4e7985a2> in <module>()
     68 
     69 
---> 70 traceback(traceback_matrix, score_matrix, seq1, seq2)

<ipython-input-378-758f4e7985a2> in traceback(traceback_matrix, score_matrix, seq1, seq2, start_row, start_col)
     30             for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
     31                 print("aligned_seq:", aligned_seq, "input_seq:", input_seq)
---> 32                 aligned_seq.append(str(input_seq[current_col-1]))
     33                 print("aligned_seq_new:", aligned_seq)
     34             for aligned_seq, input_seq in zip(aligned_seqs2, seq2):

IndexError: string index out of range

这是功能：

def traceback(traceback_matrix, score_matrix, seq1, seq2, start_row=start_row, start_col=start_col):
    """Traverse traceback_matrix to get optimal alignments
    """
    _traceback_encoding = {'match': 1, 'vertical-gap': 2, 'horizontal-gap': 3,
                       'uninitialized': -1, 'alignment-end': 0}

    aend = _traceback_encoding['alignment-end']
    match = _traceback_encoding['match']
    vgap = _traceback_encoding['vertical-gap']
    hgap = _traceback_encoding['horizontal-gap']
    uninitialized = _traceback_encoding['uninitialized']
    gap_character = '-'

    aligned_seqs1 = [[] for e in range(len(seq1))]
    aligned_seqs2 = [[] for e in range(len(seq2))]

    current_row = start_row
    #print("current_row:", current_row)
    current_col = start_col

    best_score = score_matrix[current_row, current_col]
    print("best_score:", best_score)
    current_value = None

    while current_value != aend:
        current_value = traceback_matrix[current_row, current_col]
        print("current_value:", current_value, "current_row:", current_row, "current_col:", current_col)

        if current_value == match:
            for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
                print("aligned_seq:", aligned_seq, "input_seq:", input_seq)
                aligned_seq.append(str(input_seq[current_col-1]))
                print("aligned_seq_new:", aligned_seq)
            for aligned_seq, input_seq in zip(aligned_seqs2, seq2):
                #print("aligned_seq2:", aligned_seq, "input_seq2:", input_seq)
                aligned_seq.append(str(input_seq[current_row-1]))
            current_row -= 1
            current_col -= 1
        elif current_value == vgap:
            for aligned_seq in aligned_seqs1:
                aligned_seq.append(gap_character)
            for aligned_seq, input_seq in zip(aligned_seqs2, seq2):
                aligned_seq.append(str(input_seq[current_row-1]))
            current_row -= 1
        elif current_value == hgap:
            for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
                aligned_seq.append(str(input_seq[current_col-1]))
            for aligned_seq in aligned_seqs2:
                aligned_seq.append(gap_character)
            current_col -= 1
        elif current_value == aend:
            continue
        else:
            raise ValueError(
                "Invalid value in traceback matrix: %s" % current_value)

    for i in range(len(seq1)):
        aligned_seq = ''.join(aligned_seqs1[i][::-1])
        constructor = str
        aligned_seqs1[i] = constructor(aligned_seq)

    for i in range(len(seq2)):
        aligned_seq = ''.join(aligned_seqs2[i][::-1])
        constructor = str
        aligned_seqs2[i] = constructor(aligned_seq)

    return aligned_seqs1, aligned_seqs2, best_score, current_col, current_row

输入：得分矩阵，回溯矩阵，sequence1，sequence2，start_row，start_column

score_matrix：

(array([[  0.,  -2.,  -4.,  -6.,  -8., -10., -12., -14., -16., -18., -20.],
        [ -2.,  -2.,  -3.,  -5.,  -7.,  -9., -11., -13., -15., -17., -19.],
        [ -4.,  -4.,  -3.,   1.,  -1.,  -3.,  -5.,  -7.,  -9., -11., -13.],
        [ -6.,  -6.,  -5.,  -1.,  -1.,  -3.,   8.,   6.,   4.,   2.,   0.],
        [ -8.,   2.,   0.,  -2.,  -3.,  -3.,   6.,   6.,  14.,  12.,  10.],
        [-10.,   0.,   7.,   5.,   3.,   1.,   4.,   4.,  12.,  19.,  17.],
        [-12.,  -2.,   5.,  11.,   9.,   7.,   5.,   4.,  10.,  17.,  18.],
        [-14.,  -4.,   3.,   9.,   9.,   8.,   6.,   4.,   8.,  15.,  22.]])

traceback_matrix：

array([[0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        [2, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3],
        [2, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3],
        [2, 1, 2, 2, 1, 3, 1, 3, 3, 3, 3],
        [2, 1, 3, 3, 1, 1, 2, 1, 1, 3, 3],
        [2, 2, 1, 3, 3, 3, 2, 1, 2, 1, 3],
        [2, 2, 2, 1, 3, 3, 3, 1, 2, 2, 1],
        [2, 2, 1, 2, 1, 1, 3, 3, 2, 1, 1]]))

seq1和seq2：

seq1 = "HEAGAWGHEE"
seq2 = "PAWHEAE"

start_row和start_col：

start_row = score_matrix.shape[0]-1
start_col = score_matrix.shape[1]-1

输出应该是以下对齐的序列及其得分：

HEAGAWGHE-E
-PA--W-HEAE
1.0

理想情况下，错误发生的地方，我们希望将input_seq放入aligned_seqs1的current_col-1位置。

提前致谢。

Answer 1

如果运行提供的代码，您可以在输出中看到这一行：

aligned_seq: [] input_seq: H

这是由引发异常的行上方的print生成的。

让我们仔细看看：

for aligned_seq, input_seq in zip(aligned_seqs1, seq1):
print("aligned_seq:", aligned_seq, "input_seq:", input_seq) 
aligned_seq.append(str(input_seq[current_col-1]))   # <--- Error

我只能猜测，您的假设是zip会将aligned_seqs1中的每个元素与seq1连接起来，但实际上它会将seq1分解为单个字符，因此input_seq总是一个字符长。这就是为什么引用除了第一个字符之外的任何东西都会引发错误。

如果您只是评论引发错误的行，您会看到完整的输出，并且您会看到input_seq逐个保存来自初始seq1的每个字符。

我希望这有助于您取得进一步进展。如果您需要任何其他帮助，请告诉我。

感到困惑的是

1 个答案: