Question

我获得了一个FASTA格式的文件（例如来自这个网站：http://www.uniprot.org/proteomes/），它可以在某种细菌中提供各种蛋白质编码序列。我被要求提供完整的计数和文件中包含的每个单一代码氨基酸的相对百分比，并返回结果如：

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)

到目前为止我所拥有的：

 #!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

for key in counts:
    print key, counts[key]

我认为这样做是检索大写字母的所有实例，而不仅仅是蛋白质氨基酸串中包含的那些实例，我怎样才能将其限制在编码序列中？我也无法编写如何计算每个单一代码的总数

Answer 1

唯一不包含您想要的内容的行>只会忽略这些：

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import defaultdict
    counts = defaultdict(int) 
    for line in ecoli: # iterate over file object, no need to read all contents into memory
        if line.startswith(">"): # skip lines that start with >
            continue
        for char in line: # just iterate over the characters in the line
            if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                    counts[char] += 1
    total = float(sum(counts.values()))       
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

您也可以使用collections.Counter dict，因为这些行只包含您感兴趣的内容：

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import Counter
    counts = Counter()
    for line in ecoli: # iterate over file object, no need to read all contents onto memory
        if line.startswith(">"): # skip lines that start with >
            continue
        counts.update(line.rstrip())
    total = float(sum(counts.values()))
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

Answer 2

你说明你接近这个的方式是正确的，你将计算角色的实例，即使在描述行中也是如此。

但你的代码甚至不会运行，你试过吗？你有line.split（）但是行未定义（以及许多其他错误）。此外，当您在大肠杆中的字符串时，＆＃34;＆＃34;你已经逐字逐句了。

一种简单的方法是读取文件，拆分换行符，跳过以＆＃34;＆gt;＆＃34;开头的行，计算你关心的每个角色的数量并保持分析所有角色的总计。

#!/usr/bin/python
ecoli = open("/home/file_pathway.faa").read()
counts = dict()
nucleicAcids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
for acid in nucleicAcids:
    counts[acid] = 0
total = 0

for line in ecoli.split('\n'):
    if ">" not in line:
        total += len(line)
        for acid in counts.keys():
            counts[acid] += line.count(acid)

Answer 3

使用Counter使得它更容易并避免使用字典（我喜欢dicts，但在这种情况下，Counter确实有意义。）

from collections import Counter
acids = ""                      # dunno if this is the right terminology
with open(filename, 'r') as ecoli_file:
    for line in ecoli_file:
        if line.startswith('>'):
            continue
        # from what I saw in the FASTA files, the character-check is
        # not necessary anymore...
        acids += line.strip()   # stripping newline and possible whitespaces
 counter = Counter(acids)       # and all the magic is done.
 total = float(sum(counter.values()))
 for k, v in counter.items():
     print "{}: {} ({:.1%})".format(k, v, v / total)

当Counter接受迭代时，应该可以使用生成器来实现：

from collections import Counter
with open(filename) as f:
    counter = Counter(c for line in f if not line.startswith('>')
                      for c in line.strip())
# and now as above
total = float(sum(counter.values()))
for k, v in counter.items():
    print "{}: {} ({:.1%})".format(k, v, v/total)

Python Dict和For循环与FASTA文件

3 个答案: