我想从fasta文件中计算kmers。我有以下脚本:
import operator
seq = open('file', 'r')
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), key=itemgetter(1), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
这适用于只有一个序列的文件,但现在我有一个带有几个重叠群的fasta文件。 我的fasta文件看起来像这样:
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
如何在“> 1”之后更改首先执行序列的脚本,打印输出,转到“> 2”,打印输出等?
答案 0 :(得分:0)
我从未听说过kmer或fasta,但我想我明白你要做什么。
您可以尝试拆分涉及'>'的正则表达式,但我建议逐行处理文件并累积kmers,然后在到达'> 1'行时正确打印它们。请参阅以下带注释的代码
import operator
def printSeq(name, seq):
# Extract your code into a function and print header for current kmer
print("%s\n################################" %name)
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
with open('file', 'r') as f:
seq = ""
key = ""
for line in f.readlines():
# Loop over lines in file
if line.startswith(">"):
# if we get '>' it is time for a new sequence
if key and seq:
# if it wasn't the first we should print it before overwriting the variables
printSeq(key, seq)
# store name after '>' and reset sequence
key = line[1:].strip()
seq = ""
else:
# accumulate kmer until we hit another '>'
seq += line.strip()
# when we are done with all the lines, print the last sequence
printSeq(key, seq)
答案 1 :(得分:0)
我尝试使用您的示例FASTA文件,它应该可以工作:
def count_kmers(seq, k, kmers):
for i in range(len(seq) - k + 1):
kmr = seq[i:i + k]
if kmr in kmers:
kmers[kmr] += 1
else:
kmers[kmr] = 1
filename = raw_input('File name/path: ')
k = input('Value for k: ')
kmers = {}
# Put each line of the file into a list (avoid empty lines)
with open(filename) as f:
lines = [l.strip() for l in f.readlines() if l.strip() != '']
# Find the line indices where a new sequence starts
idx = [i for (i, l) in enumerate(lines) if l[0] == '>']
idx += [len(lines)]
for i in xrange(len(idx) - 1):
start = idx[i] + 1
stop = idx[i + 1]
sequence = ''.join(lines[start:stop])
count_kmers(sequence, k, kmers)
print kmers
希望有所帮助:)