我在从Fasta文件中删除所有空格时遇到问题,这是我到目前为止的程序:
import re
for line in f:
line = line.rstrip(' \n\r')
if line.startswith(">"):
seqid = re.search('Segment:[(0-9)]',line).group()
seqID.append(seqid)
else:
numSeq = len(line)
这就是测试文件的样子(我只用了第一对来展示seqId):
>gb:CY170782|Organism:Influenza A virus A/Santa Clara/YGA_03044/2013|Segment:1|Subtype:H3N2|Host:Human
ATTATATTCAGTATGGAAAGAATAAAAGAACTACGGAATCTGATGTCGCAGTCTCGCACTCGCGAGATAC
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC
GTCACTTAGGATGAAATGGATGATGGCAATGAAATATCCAATCACTGCTGACAAAAGGGTAACAGAAATG
>gb:CY171006|Organism:Influenza A virus A/Santa Clara/YGA_03075/2013|Segment:1|Subtype:H3N2|Host:Human
ATTATATTCAGTATGGAAAGAATAAAAGAATTACGGAATCTGATGTCGCAATCTCGCACTCGCGAGATAC
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC
GTCACTTAGGATGAAATGGATGATGGCAATGAAATACCCAATCACTGCTGACAAAAGAATAACAGAAATG
当我打印出来时,它打印出来像这样:
ATTATATTCAGTATGGAAAGAATAAAAGAACTACGGAATCTGATGTCGCAGTCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATATCCAATCACTGCTGACAAAAGGGTAACAGAAATG 70
0
ATTATATTCAGTATGGAAAGAATAAAAGAATTACGGAATCTGATGTCGCAATCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATACCCAATCACTGCTGACAAAAGAATAACAGAAATG 70
0
ATTATATTCAGTATGGAAAGAATAAAAGAACTACGGAATCTGATGTCGCAGTCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATATCCAATCACTGCTGACAAAAGGGTAACAGAAATG 70
0
如何让它加入细胞系并去除0核苷酸的细胞系?对不起由于睡眠不足导致措辞不佳。如果您对我的问题有疑问,请随时提出。
这是完整的程序:
from __future__ import division
import re
f = open('fastatest.fasta','r')
numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0
#Concatinating the FASTA file
for line in f:
line = line.rstrip(' \n\r')
if line.startswith(">"):
seqid = re.search('Segment:[(0-9)]',line).group()
seqID.append(seqid)
else: #Find the Length and GC%
numSeq = len(line)
#print seqid, numSeq
GCPercent = (( line.count('G') + line.count('C') ) / (numSeq)*100)
allGC.append(GCPercent);
sequences.append(line)
seqLen.append(numSeq)
print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)
我收到的输出:
Segment:1 70 40.00
Segment:1 70 44.29
Segment:1 70 38.57
Traceback (most recent call last):
File "blah", line 20, in <module>
GCPercent = (( line.count('G') + line.count('C') ) / (numSeq)*100)
ZeroDivisionError: division by zero
答案 0 :(得分:2)
尝试使用Biopython
from Bio import SeqIO
for record in SeqIO.parse("fasta.fas","fasta"):
print record.id
print record.seq
这应该删除所有新行......
答案 1 :(得分:1)
也许条件追加有效?
if not seqid.strip.startswith('0'):
seqID.append(seqid)
如果没有,那将会有助于了解seqid
的样子。
答案 2 :(得分:0)
当行的长度为0时,您可以直接跳到循环的下一次迭代:
numSeq = len(line) # from your code for reference
if not numSeq:
continue
答案 3 :(得分:0)
鉴于文件在每个序列后面都有一个空行(也在最后一个序列之后!),这应该有效:
if line.startswith(">"):
seqid = re.search('Segment:[(0-9)]',line).group()
seqID.append(seqid)
sequence = ""
elif len(line.strip()):
sequence += line.strip() # three lines will make a sequence
else: #Find the Length and GC%
numSeq = len(sequence)
#print seqid, numSeq
GCPercent = (( sequence.count('G') + sequence.count('C') ) / (numSeq)*100)
allGC.append(GCPercent);
sequences.append(sequence)
seqLen.append(numSeq)
print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)
我刚刚在四个位置添加了三行,并将line
替换为sequence
。看起来像一个微小的变化解决方案,我还没有测试过它。
答案 4 :(得分:0)
您可以通过检查来忽略空行:
from __future__ import division
import re
numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0
with open('fastatest.fasta', 'r') as f:
#Concatinating the FASTA file
for line in f:
line = line.rstrip(' \n\r')
if line: # non-empty line?
if line.startswith(">"):
seqid = re.search('Segment:[(0-9)]',line).group()
seqID.append(seqid)
else: #Find the Length and GC%
numSeq = len(line)
#print seqid, numSeq
GCPercent = ((line.count('G') + line.count('C')) /
(numSeq)*100)
allGC.append(GCPercent);
sequences.append(line)
seqLen.append(numSeq)
print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)
输出:
Segment:1 70 40.00
Segment:1 70 44.29
Segment:1 70 38.57
Segment:1 70 37.14
Segment:1 70 44.29
Segment:1 70 37.14