Question

我试图在序列中为每个核苷酸（A，G，C，T）创建一个列表，其中列表的索引对应于序列中的位置，并且值是所有序列中该核苷酸的频率，这里有4个序列作为例子：

def function(filename, length):
    g,t,c,a = [],[],[],[]
    with open(filename, "r") as f:
        for line in f:
            if line.startswith('GT'):
                 gcount, acount, tcount, ccount = 0, 0, 0, 0
                 g = [gcount + 1 if nuc == 'G' else gcount for nuc in line[:length]]
                 return g

我写的代码：

[1, 0, 0, 1, 1, 1, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 0, 0]

现在，这段代码只是查看G核苷酸，我得到每个序列的列表，而不是列表中每个索引的值的1个列表。

[4, 0, 0, 1, 1, 1, 0, 2, 0]

我想要的仅仅是g的输出：

<?php
$result = json_decode( file_get_contents("sth"), true );
echo 'Number Found :'.$result["numFound"].'<br/>';
echo 'Start :'.$result["start"].'<br/>';

Answer 1

您可以使用numpy。只需将您的列表转换为numpy数组并添加。

import numpy as np

list1 = np.array([1, 0, 0, 1, 1, 1, 0, 1, 0])
list2 = np.array([1, 0, 0, 0, 0, 0, 0, 1, 0])
list3 = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0])
list4 = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0])

list1 + list2 + list3 +list4 # desired result!
>>> array([4, 0, 0, 1, 1, 1, 0, 2, 0])

以下是如何修改当前功能以支持此功能：

def function(filename, length)
    g,t,c,a = [],[],[],[]
    # create an array of expected length of g filled with 0s
    base = np.zeros((1,length)) # 1 row, `length` number of columns
    with open(filename, "r") as f:
        for line in f:
            if line.startswith('GT'):
                gcount, acount, tcount, ccount = 0, 0, 0, 0
                g = np.array([gcount + 1 if nuc == 'G' else gcount for nuc in line[:length]])
                base = base + g # add this new numpy array
    return base # return the summed result

以下是numpy的安装说明。

Answer 2

你可以这样做：

g = [sum(_) for _ in zip(*[[1, 0, 0, 1, 1, 1, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0]])]

Answer 3

这可能更容易理解。请注意，如果输入文件太大，它最终会失败：

import collections

Test_data = """>ignore this
GTAGGGCGA
>ignore this
GTATACAGC
>ignore this
GTTTCTCTT
>ignore this
GTAATCAAA
"""


import io
testfh = io.StringIO(Test_data)

counts = [collections.Counter(fil) for fil in zip(*(x.strip() for x in testfh if not x.startswith('>')))]

for key in 'ACGT':
    key_counts = [cnt[key] for cnt in counts]
    print("{}: {}".format(key, key_counts))

输出如下：

A: [0, 0, 3, 1, 1, 0, 2, 1, 2]
C: [0, 0, 0, 0, 1, 2, 2, 0, 1]
G: [4, 0, 0, 1, 1, 1, 0, 2, 0]
T: [0, 4, 1, 2, 1, 1, 0, 1, 1]

修改

没有理解：

counts = [collections.Counter(fil) for fil in zip(*(x.strip() for x in testfh if not x.startswith('>')))]

变成这样：

clean_lines = [] for x in testfh: if not x.startswith('>'): clean_lines.append(x.strip())

此时，clean_lines只包含好的部分，没有换行符：

GTAGGGCGA GTATACAGC GTTTCTCTT GTAATCAAA

接下来，我将它们侧向转动，以便将垂直条纹送到Counter：

file_and_rank = zip(*clean_lines)

在该行中，*clean_lines获取每一行（这是一个字符串）并将它们展平为一个大参数列表，就像我调用了一样：

file_and_rank = zip('GTAGGGCGA', 'GTATACAGC', 'GTTTCTCTT', 'GTAATCAAA', ...)

zip操作将迭代组合在一起。它同时遍历所有这些，每个可迭代取一个值。然后它将所有值放在一起组成一个元组，然后返回。因此，它将GTAGGGCGA，...字符串转换为每个字符串的第一个，第二个，第三个等字符的元组：

(GGGG) (TTTT) (AATA) (GTTA) ...

接下来，我需要建立每个位置有多少核苷酸的计数。所以我会使用collections.Counter（想要一个可迭代的！）但我必须为每个位置分别设置一个。所以列出了它们：

counts = [] for fil in file_and_rank: counts.append(collections.Counter(fil))

我只是接受一个元组，比如（AATA）并将其传递给collections.Counter的构造函数。这将为该职位生成一个{A:3, T:1}的计数器。

Answer 4

与奥斯汀一样，我建议使用收藏模块中的计数器。它是为这种任务而构建的。这是我的变化：为每个核苷酸保留一个计数器，并向计数器提供每个核苷酸发生的位置。它像原始代码一样一次处理一行。

from collections import Counter

def function(filename, length):
    # a counter for each nucleotide
    count = {'G':Counter(),
             'T':Counter(),
             'C':Counter(),
             'A':Counter()
             }

    max_length = 0

    with open(filename, "r") as f:
        for line in f:
            if line.startswith('GT'):
                for position, nuc in enumerate(line):
                    # update position count for the nucleotides in this line
                    count[nuc].update([position])  

                # keep track of longest line 
                max_length = max(max_length, position)

    g = [[count[nuc][i] for i in range(max_length)] for nuc in 'GTCA']

    return g

在特定索引python处递增列表的值

4 个答案: