Question

我正在编写一个函数，该函数应该通过DNA序列的.fasta文件，并为文件中的每个序列创建核苷酸（nt）和二核苷酸（dnt）频率的字典。然后我将每个字典存储在一个名为“频率”的列表中。这是一段奇怪的代码：

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            freq[dinucleotide] = twofreq
    frequency.append(freq)

（顺便说一下，我正在使用biopython，所以我不必将整个fasta文件提交到内存。这也适用于ssDNA，所以我不必考虑反义dnt）

为单个nt记录的频率增加到1.0，但dnt的频率不会增加到1.0。因为计算两种频率的方法在我眼中是相同的，所以这是od。

我将诊断打印语句和“检查”变量留在：

for fasta in seq_file:
    freq = {}
    dna = str(fasta.seq)
    check = 0.0
    check2 = 0.0
    for base1 in ['A', 'T', 'G', 'C']:
        onefreq = float(dna.count(base1)) / len(dna)
        freq[base1] = onefreq
        check2 += onefreq
        for base2 in ['A', 'T', 'G', 'C']:
            dinucleotide = base1 + base2
            twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) 
            check += twofreq
            print(twofreq)
            freq[dinucleotide] = twofreq
    print("\n")
    print(check, check2)
    print(len(dna))
    print("\n")
    frequency.append(freq)

得到这个输出:(只有一个序列）

0.0894168466523 
0.0760259179266
0.0946004319654
0.0561555075594
0.0431965442765
0.0423326133909
0.0747300215983
0.0488120950324
0.0976241900648
0.0483801295896
0.0539956803456
0.0423326133909
0.0863930885529
0.0419006479482
0.0190064794816
0.031101511879


(0.9460043196544274, 1.0)
2316

在这里我们可以看到16种不同dnt中每一种的频率，所有dnt频率之和（0.946）和所有nt频率之和（1.0）以及序列长度。

为什么dnt频率加起来不是1.0？

感谢您的帮助。我是python的新手，这是我的第一个问题，所以我希望这些提交是可以接受的。

Answer 1

你的问题，尝试使用以下fasta：

>test
AAAAAA

"AAAAAA".count("AA")

你得到：

应该是

<强>原因

来自文档的

：count返回字符串s中[subs：end]

中substring sub出现的次数（非重叠）次数

解决方案使用Counter和块功能

from Bio import SeqIO

def chunks(l, n):
  for i in xrange(0, len(l)-(n-1)):
    yield l[i:i+n]

from collections import Counter

frequency = []
input_file = "test.fasta"
for fasta in SeqIO.parse(open(input_file), "fasta"):
  dna = str(fasta.seq)
  freq = Counter(dna)   #get counter of single bases
  freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide
  frequency.append(freq)

获得“AAAAAA”：

Counter({'A': 6, 'AA': 5})

Answer 2

str.count（）不计算它找到的重叠主题。

例：

如果您的序列中有'AAAA'并且您正在寻找二核苷酸'AA'，那么您期望比'AAAA'.count（'AA'）返回3，但它将返回2.所以：

print float('AAAA'.count('AA')) / (len('AAAA') - 1)
0.666666

而不是1

您只需更改计算频率的行：

twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide]) / float((len(dna) - 1))

Answer 3

你扫描的字符串远远超过你需要的 - 实际上是20次。对于小的测试序列而言这可能无关紧要，但随着它们变大，它会很明显。我会推荐一种不同的方法，它解决了重叠作为副作用的问题：

nucleotides = [ 'A', 'T', 'G', 'C' ]
dinucleotides = [ x+y for x in nucleotides for y in nucleotides ]
counts = { x : 0 for x in nucleotides + dinucleotides }

# count the first nucleotide, which has no previous one
n_nucl = 1
prevn = dna[0]
counts[prevn] += 1

# count the rest, along with the pairs made with each previous one
for nucl in dna[1:]:
    counts[nucl] += 1
    counts[prevn + nucl] += 1
    n_nucl += 1
    prevn = nucl

total = 0.0
for nucl in nucleotides:
    pct = counts[nucl] / float(n_nucl)
    total += pct
    print "{} : {} {}%".format(nucl, counts[nucl], pct)
print "Total : {}%".format(total) 

total = 0.0
for dnucl in dinucleotides:
    pct = counts[dnucl] / float(n_nucl - 1)
    total += pct
    print "{} : {} {}%".format(dnucl, counts[dnucl], pct)
print "Total : {}%".format(total)

这种方法只扫描字符串一次，虽然它确实是更多的代码......

频率最高不超过一

3 个答案: