Question

我正在尝试为文件中某个长度基序中看到的每个DNA核苷酸（A，G，C，T）创建列表列表，每个核苷酸的列表计算核苷酸的频率在每个位置。

示例：

>this line of the file should be ignored, will always start with >
AGTCCCGCCCGGAG

>this is start of next seq
GGTCAGTCAAAAGTGAGCC

我希望我的代码从它在序列中遇到的第一个'GT'开始（这里是位置1,2）并开始列出A，G，T和C的列表。对于每个位置直到a给定长度（由用户输入）列表中该核苷酸的值将增加1。我是一名学生，我想学习如何使用Python中的列表列表。对于上面的seq，用户输入长度为6，我想要返回的代码：

alist = [0 0 0 1 0 0]
glist = [2 0 0 0 1 1]
clist = [0 0 2 1 1 0]
tlist = [0 2 0 0 0 1]

我需要代码来查看文件中的所有序列（新的seq由以＆gt;开头的行表示）并更新这些列表，保持这些位置在序列中保持不变并增加每个核苷酸的适当位置。这是我到目前为止所做的，但它是一个非常的眼睛，我有str与列表类型的问题......

def position_frequency_matrix(filename, length):
        glist,tlist,clist,alist = [],[],[],[]
        gcount, acount, tcount, ccount = 0,0,0,0
        pos = 0
        with open(filename, "r") as f:
             for line in f:
                if not line.startswith('>'):
                    if 'G' and 'T' in line:
                        pos = line.index('GT')
                        for nuc in range(len(line[line.index('GT'):length])):
                            line = list(line)
                            pos += 1
                            if nuc == 'G':
                                gcount += 1
                                glist.append(int(gcount))
                            if nuc == 'T':
                                tcount += 1
                                tlist.append(int(tcount))
                            else:
                                tlist.append(0)
                            if nuc == 'C':
                                ccount += 1
                                clist.append(int(ccount))
                            else:
                                clist.append(0)
                            if nuc == 'A':
                                acount += 1
                                alist.append(int(acount))
                            else:
                                alist.append(0)
            return(alist,glist,clist,tlist)

请以易于阅读的格式提交回复;浓缩pythonic代码可以解决这个问题，但如果我不能解压缩代码并以我自己的方式编写它，就无法帮助像我这样的学生学习python。谢谢！

Answer 1

您的代码存在一些问题。我没有完整的示例数据集来运行它，但这里有一些指示。第一：

for nuc in range(len(line[line.index('GT'):length])):

在这里，您要制作一个整数列表，该整数列表位于＆＃39; GT＆＃39;找到，直到索引length。所以，nuc是一个整数，就像

一样

for i in range(5):

你（可能）正在尝试做的事情应该是：

for nuc in line[line.index('GT'):length]:

现在nuc是你正在切片的字符串中的一个字符。这不是非常强大：如果＆＃39; GT＆＃39;不在这里？然而，这是一个开始，你可以轻松改进。你正在检查：

if 'G' and 'T' in line:

但如果＆＃39; G＆＃39;那也是如此。和＆＃39; T＆＃39;彼此并不相近：

'GaccaTcc'

因此line.index('GT')指令仍然会失败ValueError。您可以处理该异常并跳过if 'G' and 'T' in line:或只检查if 'GT' in line:

希望这有帮助。

在Fasta文件中跨越序列的python计数主题

1 个答案: