在跳过标题的文件中将多行合并为单行

时间:2017-04-27 09:32:02

标签: linux awk sed

我在一个文件夹中有几千个文件。每个文件的内容如下所示。我在这个例子中的文件名是:AAB08704.1.fasta

   >gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
   MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTE
   VHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILI
   PARIH

我想跳过第一行,然后将剩余的行合并为一行。我的所有文件都以">"开头这是标题信息,以下行是我想要合并为一行的序列信息。

我试过

    sed -i '2,$s/\n//g' AAB08704.1.fasta

我甚至尝试使用以下方法将multiline fasta转换为单行fasta:

   awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < AAB08704.1.fasta 

这些命令都没有达到我的预期。任何线索?

预期产出:

   >gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
   MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH

cat -A AAB08704.1.fasta给出了这个:

  M-oM-;M-?>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]^M$
  MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTE^M$
  VHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILI^M$

4 个答案:

答案 0 :(得分:1)

使用perl

$ perl -pe 's/\n// if $. > 1 && !eof' AAB08704.1.fasta 
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
  • s/\n//删除换行符
    • if $. > 1 && !eof仅当行号大于1而不是文件结尾
  • 使用perl -i -pe进行就地编辑。有关-i-p-e
  • 的文档,请参阅Command Switches

答案 1 :(得分:0)

喜欢这个?对于GNU awk:

$ awk '{p=p $0 (FNR==1?ORS:"")}ENDFILE{print p;p=""}' file file
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH

这个删除第一条记录上>之前的所有字符:

$ awk 'FNR==1{sub(/^[^>]*/,"");p=$0 ORS;next}{p=p $0}ENDFILE{print p;p=""}' file file
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH

答案 2 :(得分:0)

sed是面向行的,因此需要在缓冲区中加载行而不是删除\n

sed -i -e '1d' -e 'H;${x;s/\n//g}' AAB08704.1.fasta

awk可以在行为中进行调整

awk 'BEGIN{FS="\n";RS="()";OFS=""}{$1="";$0=$0 ""}' AAB08704.1.fasta > tmp && mv tmp AAB08704.1.fasta

# or 
awk '!a++{next}{printf( "%s", $0) > (FILENAME ".tmp")}' AAB08704.1.fasta && mv AAB08704.1.fasta.tmp AAB08704.1.fasta
# or
awk 'NR>1{printf("%s",$0)}' AAB08704.1.fasta > tmp && mv tmp AAB08704.1.fasta

答案 3 :(得分:0)

这也有效:

awk 'BEGIN{ ORS = "" }/^>/{ print $0, "\n"}NR>1{ print $0 }' file

输出:

>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta] 
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH