我正在尝试使用biopython从多Fasta文件中提取信息(例如C / G / A / T计数,CG%)。当我尝试遍历每个fasta序列的文件时,我总是遇到麻烦-我只能打印出第一个。
我怀疑这可能与我的文件格式有关,因为它不是实际的fasta文件,但我不知道该如何更改。
input_file = open("inputfile.fa", 'r')
output_file = open('nucleotide_counts.txt','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta"):
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = (float(C_count + G_count) / length)*100
output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
output_file.write(output_line)
output_file.close()
input_file.close()
这是我的multifasta的样子(指定了开始和结束)
>1:start-end
CGCCCCAGTGATGTAGCCGAA
>1:start-end
CGGCCACCCCGAAGCGTGGGG
我的输出文件仅包含一行:
Gene A C G T Length CG%
1:start-end 85 115 180 59 439 67.198178
答案 0 :(得分:0)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 31 15:11:53 2020
@author: Pietro
"""
input_file = 'fasta'
output_file_name = input_file+'out'
#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO
output_file = open(output_file_name, 'w+')
output_file.write(('Gene\tA\tC\tG\tT\tLength\tCG%\n'))
for cur_record in SeqIO.parse(input_file, "fasta"):
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = (float(C_count + G_count) / length)*100
output_line = str('%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage))
output_file.write(output_line)
output_file.close()