使用正则表达式格式化FASTA序列

时间:2011-05-03 03:40:47

标签: python regex

生物信息学家!

我试图创建一个python脚本,允许我接收包含非FASTA格式序列的文件,然后将它们转换为FASTA格式,然后将它们全部写入包含所有格式的单个文件序列。

例如:两个非格式化的FASTA格式序列......

非格式化1

1 tcacatctct acgtactgaa tttaaaggct ttttgtcttt ttctcgtttc tttgcttttc 
61 aatgatgttc aagcgtaacc tcggaaaatg tgtacaaact tgagtacaaa tcgccatatt 

非格式化2

1 tcaggagaat gcagatgaca gcagtagcgc accaagtaac cccttttcta acgtcttacg 
61 aagttatggc tcgttaccac attagctata cgacgctctg gcgaagaata aaagatggca

并希望将它们转换为:

>seq1
TCACATCTCTACGTACTGAATTTAAAGGCTTTTTGTCTTTTTCTCGTTTCTTTGCTTTTC
AATGATGTTCAAGCGTAACCTCGGAAAATGTGTACAAACTTGAGTACAAATCGCCATATT
TACCGTTTTTAGCCAAATTCCATGACACAAACCTAGCTGTAGGCCTTGTTCCTACTGGGT
TTTAGCCAAAACTTGCCTATATTTTTTATGCCAAAAATCGAGAAATGATGGTAAGACGTT
CGCGATTATCTCTAATTGTTTGCCGGTTGAGTTGGTTACCGGTTGCTTTCTTGCTGTCC

>seq2
TCAGGAGAATGCAGATGACAGCAGTAGCGCACCAAGTAACCCCTTTTCTAACGTCTTACG
AAGTTATGGCTCGTTACCACATTAGCTATACGACGCTCTGGCGAAGAATAAAAGATGGCA
GCTTGCCGCAACCTCGTATCAACCGAAATACACGAAACAAGCTGTGGCACATTGAAGACT
TGGAGGAGTATGAGAAGAATTAGGAATAGATAGCGTAGCTTAGTTTTTCTGTTGGAGCTT
GGACTAACGCTTTGAAACGCCGGCTTGTGCCAACAATATAGTTAATATGTACACCAACTT
AGGCTAAGATAGCAGCATGGATTTTTTATTGATTGGATGGATAGGTAAGTGACGACTCCT
CAAGAACGGACAACAGGTATTACAAATGCGTCGATAAAAA

到目前为止,我有这个:

def cleanandFormat(filename, seqName, seq):
"""
writes out the sequence of an irregular sequence format to a file, while cleaning and       formatting it into the standard form
inputs:
    filename - string of a filename
    seqName - string of sequence description
    seq - string of the sequence
output: clean and standard-formatted data to a file.
"""
#sets the blocklength for the max number of characters in a line
blockLength = 60
with open(filename, 'w') as fh:
    #write out the header and sequence name
    fh.write('>' + seqName + '\n')
    for i in range(0, len(seq), blockLength):
        fh.write(seq[i:i+blockLength].upper() + '\n')


#defines the pattern as any digit and any whitespace
pattern = '\d|\s'
#this will replace the pattern found in the sequence with an empty string
replace = ''

seq = ''
filename = 'seqCleanup2.txt'
with open(filename) as fh:
for line in fh:
    seq += re.sub(pattern, replace, line)
    cleanandFormat('testfasta.txt', 'seqX', seq)

1 个答案:

答案 0 :(得分:0)

我不知道python但是这里是Ruby(未经测试),只需下载ruby,将其保存在文件中并运行它:

count = 0
while line = gets
  if(line =~ /^1\s[a-z\s]+$/)
    count += 1
    puts
    puts ">seq#{count}"
  end
  if(line =~ /^\d+\s([a-z\s]+)$/)
    puts $1.gsub(/\s/, "").upcase
  end
end