Question

我正在尝试使用BioPython编辑ClustalW生成的MSA（多序列比对）文件，以在共有序列之前修剪序列。 xxx指的是此处不相关的其他碱基

这是示例I / O：

输入

ITS_primer_fw               --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxxxxx
PRL2010                     ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32                     CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxxxx

预期产量

ITS_primer_fw               CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence      CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3                     CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32                     CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx

AlignIO的文档代码仅描述了一种通过将比对视为 array 来提取序列的方法。在此示例

align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]

我能够提取由第20个核苷酸开始的所有序列（:)形成的亚序列。我正在寻找一种用共有序列的第一个核苷酸的位置替换示例中的20的方法。

Answer 1

找到了答案，这要感谢Biostars用户。

抽搐正在查看列以查找起点，该起点将按预期在最后一个'-'之后。默认情况下，进入对齐方式的第一行是最短的，并且在对齐方式良好之前以“-”开头。

所以这是代码：

aln = AlignIO.read(input_file, "clustal")
for col in range(aln.get_alignment_length()):  # search into column
    res = list(aln[:,col])
    if not '-' in res:
        position = col                         # find start point
        print('First full column is {}'.format(col))
        break
print(aln[:,position::])                       # print the whole alignment starting from the position variable found

根据比对修剪序列

1 个答案: