我在一个文件夹中有几千个文件。每个文件的内容如下所示。我在这个例子中的文件名是:AAB08704.1.fasta
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTE
VHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILI
PARIH
我想跳过第一行,然后将剩余的行合并为一行。我的所有文件都以">"开头这是标题信息,以下行是我想要合并为一行的序列信息。
我试过
sed -i '2,$s/\n//g' AAB08704.1.fasta
我甚至尝试使用以下方法将multiline fasta转换为单行fasta:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < AAB08704.1.fasta
这些命令都没有达到我的预期。任何线索?
预期产出:
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
cat -A AAB08704.1.fasta给出了这个:
M-oM-;M-?>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]^M$
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTE^M$
VHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILI^M$
答案 0 :(得分:1)
使用perl
$ perl -pe 's/\n// if $. > 1 && !eof' AAB08704.1.fasta
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
s/\n//
删除换行符
if $. > 1 && !eof
仅当行号大于1而不是文件结尾perl -i -pe
进行就地编辑。有关-i
,-p
和-e
答案 1 :(得分:0)
喜欢这个?对于GNU awk:
$ awk '{p=p $0 (FNR==1?ORS:"")}ENDFILE{print p;p=""}' file file
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
这个删除第一条记录上>
之前的所有字符:
$ awk 'FNR==1{sub(/^[^>]*/,"");p=$0 ORS;next}{p=p $0}ENDFILE{print p;p=""}' file file
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH
答案 2 :(得分:0)
sed是面向行的,因此需要在缓冲区中加载行而不是删除\n
sed -i -e '1d' -e 'H;${x;s/\n//g}' AAB08704.1.fasta
awk可以在行为中进行调整
awk 'BEGIN{FS="\n";RS="()";OFS=""}{$1="";$0=$0 ""}' AAB08704.1.fasta > tmp && mv tmp AAB08704.1.fasta
# or
awk '!a++{next}{printf( "%s", $0) > (FILENAME ".tmp")}' AAB08704.1.fasta && mv AAB08704.1.fasta.tmp AAB08704.1.fasta
# or
awk 'NR>1{printf("%s",$0)}' AAB08704.1.fasta > tmp && mv tmp AAB08704.1.fasta
答案 3 :(得分:0)
这也有效:
awk 'BEGIN{ ORS = "" }/^>/{ print $0, "\n"}NR>1{ print $0 }' file
输出:
>gi|1117824|gb|AAB08704.1| ecdysteroid regulated 16 kDa [Manduca sexta]
MLFYITVTVLLVSAQAKFYTDCGSKLATVQSVGVSGWPENARECVLKRNSNVTISIDFSPTTDVSAITTEVHGVIMSLPVPFPCRSPDACKDNGLTCPIKAGVVANYKTTLPVLKSYPKVSVDVKWELKKDEEDLVCILIPARIH