合并contigs.fa生成的两行

时间:2012-12-21 10:43:00

标签: python linux sed awk fasta

我有汇编程序生成的文件。它看起来像是在跟随。

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC

我想使用python或linux sed命令合并这些行,并希望以这种方式获得结果。

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

就像每个seqeunce都认为是单行,节点名称视为其他行。

5 个答案:

答案 0 :(得分:2)

trsed的小管道会执行此操作:

$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa 

python

file = open('contigser.fa','r+')
lines= file.read().splitlines()

file.seek(0)
file.truncate()

for line in lines:
    if line.startswith('>'):
        file.write('\n'+line+'\n')
    else:
        file.write(line)

注意:python解决方案将更改存储回contigser.fa

答案 1 :(得分:1)

您可以使用awk来完成这项工作:

awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'

这只会启动一个进程(awk)。唯一的缺点:它增加了一个空的第一行。你可以通过添加一个状态变量来避免这种情况(代码属于一行,只是为了使其更易读):

awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
    { printf "%s", $0; flag=1 } END { if (flag) print "" }'

@how将其存储在新文件中:

awk < input_file > output_file '/^>/ { .... }'

答案 2 :(得分:0)

$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

答案 3 :(得分:0)

$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

答案 4 :(得分:0)

这可能适合你(GNU sed):

sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file

在以>开头的行之后,删除除>符号之外的任何字符之前的所有换行符。