生物信息学家!
我试图创建一个python脚本,允许我接收包含非FASTA格式序列的文件,然后将它们转换为FASTA格式,然后将它们全部写入包含所有格式的单个文件序列。
例如:两个非格式化的FASTA格式序列......
1 tcacatctct acgtactgaa tttaaaggct ttttgtcttt ttctcgtttc tttgcttttc
61 aatgatgttc aagcgtaacc tcggaaaatg tgtacaaact tgagtacaaa tcgccatatt
和
1 tcaggagaat gcagatgaca gcagtagcgc accaagtaac cccttttcta acgtcttacg
61 aagttatggc tcgttaccac attagctata cgacgctctg gcgaagaata aaagatggca
并希望将它们转换为:
>seq1
TCACATCTCTACGTACTGAATTTAAAGGCTTTTTGTCTTTTTCTCGTTTCTTTGCTTTTC
AATGATGTTCAAGCGTAACCTCGGAAAATGTGTACAAACTTGAGTACAAATCGCCATATT
TACCGTTTTTAGCCAAATTCCATGACACAAACCTAGCTGTAGGCCTTGTTCCTACTGGGT
TTTAGCCAAAACTTGCCTATATTTTTTATGCCAAAAATCGAGAAATGATGGTAAGACGTT
CGCGATTATCTCTAATTGTTTGCCGGTTGAGTTGGTTACCGGTTGCTTTCTTGCTGTCC
>seq2
TCAGGAGAATGCAGATGACAGCAGTAGCGCACCAAGTAACCCCTTTTCTAACGTCTTACG
AAGTTATGGCTCGTTACCACATTAGCTATACGACGCTCTGGCGAAGAATAAAAGATGGCA
GCTTGCCGCAACCTCGTATCAACCGAAATACACGAAACAAGCTGTGGCACATTGAAGACT
TGGAGGAGTATGAGAAGAATTAGGAATAGATAGCGTAGCTTAGTTTTTCTGTTGGAGCTT
GGACTAACGCTTTGAAACGCCGGCTTGTGCCAACAATATAGTTAATATGTACACCAACTT
AGGCTAAGATAGCAGCATGGATTTTTTATTGATTGGATGGATAGGTAAGTGACGACTCCT
CAAGAACGGACAACAGGTATTACAAATGCGTCGATAAAAA
到目前为止,我有这个:
def cleanandFormat(filename, seqName, seq):
"""
writes out the sequence of an irregular sequence format to a file, while cleaning and formatting it into the standard form
inputs:
filename - string of a filename
seqName - string of sequence description
seq - string of the sequence
output: clean and standard-formatted data to a file.
"""
#sets the blocklength for the max number of characters in a line
blockLength = 60
with open(filename, 'w') as fh:
#write out the header and sequence name
fh.write('>' + seqName + '\n')
for i in range(0, len(seq), blockLength):
fh.write(seq[i:i+blockLength].upper() + '\n')
#defines the pattern as any digit and any whitespace
pattern = '\d|\s'
#this will replace the pattern found in the sequence with an empty string
replace = ''
seq = ''
filename = 'seqCleanup2.txt'
with open(filename) as fh:
for line in fh:
seq += re.sub(pattern, replace, line)
cleanandFormat('testfasta.txt', 'seqX', seq)
答案 0 :(得分:0)
我不知道python但是这里是Ruby(未经测试),只需下载ruby,将其保存在文件中并运行它:
count = 0
while line = gets
if(line =~ /^1\s[a-z\s]+$/)
count += 1
puts
puts ">seq#{count}"
end
if(line =~ /^\d+\s([a-z\s]+)$/)
puts $1.gsub(/\s/, "").upcase
end
end