如何删除具有相同字符串的x个条目并仅保留一个带有修改标题的条目?

时间:2014-09-27 04:07:37

标签: bash perl awk sed fasta

我对你所有的awk / sed / perl专家都有疑问。我遇到了一个具有以下格式的文件,例如:

>GALHOMG00000016026_1 GALHOMT00000016026_1 GALHOMP00000016026_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

>HUMHOMG00000262990_1 HUMHOMT00000262990_1 HUMHOMP00000262990_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

>TGUHOMG00000002432_1 TGUHOMT00000002432_1 TGUHOMP00000002432_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

我想将此文件修改为以下内容:

>JH556633.1:35740-45316
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

我知道我可以修改我所谓的标题(我的意思是>后面的行),如下所示:

awk 'NF > 1{$0=">"$4}; {print $0}' file.fa > file2.fa

我的问题是,如何删除其他两个段落?文件中可能存在段落的字符序列(即不计算标题行)不相同的实例。在这种情况下,我想基于具有相同标识符的条目的数量附加扩展名(例如,在这种情况下JH556633.1-1:35740-45316用于第二个的JH556633.1-2:35740-45316或类似的东西)。关键是要使相同的标题(以>开头的行)不同,但如果它们不相同则保留原始的字符序列。

如果有人有想法解决这个问题,我将非常感谢您的帮助。谢谢!

3 个答案:

答案 0 :(得分:1)

这对你有用。它不依赖于不同序列之间的空行,因为并非所有的fasta文件都具有这些。它会为每个ID添加_N,其中N是找到ID的次数。仅与单个序列关联的ID将具有_1。如果ID与多个不同的序列相关联,则将打印所有这些序列。

#!/usr/bin/env perl
use strict;
use warnings;

## The field of the ID line you want to keep.
## Since we start counting from 0, to get the 4th
## field, set this to 3.
my $want=3;

my (@fields,%seqs,%seen,$seq);
## Read the input file
while (<>) {
    ## Skip blank lines
    next if /^\s*$/;
    ## remove trailing newlines
    chomp;
    ## Is this an ID line?
    if (/^\s*>(.*)/) {
        ## Save the previous sequence (if any). The %seqs 
        ## hash has the sequence as a key and the desired 
        ## ID as a value.
        if ($fields[0]) {
            $seqs{$seq}=$fields[$want];                 
            ## Clear the previous sequence and IDs
            $seq="";
            @fields=();
        }
        ## Split the ID fields into @fields.
        @fields=split(/\s+/);
    }
    ## If this is a sequence, add to $seq
    else {
        $seq.=$_;
    }
}
## Get the last sequence
$seqs{$seq}=$fields[$want];                 

foreach my $sequence (sort keys(%seqs)) {
    ## Add an identifier.
    $seen{$seqs{$sequence}}++;
    print ">$seqs{$sequence}_$seen{$seqs{$sequence}}\n";
    ## Convert the sequence back to FASTA
    $sequence=~s/(.{60})/$1\n/g;
    print "$sequence\n";
}

将脚本保存为foo.pl或其他任何内容,使其可执行chmod 744 foo.pl并运行为:

$ ./foo.pl file.fa 
>JH556633.1:35740-45316_1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

答案 1 :(得分:0)

假设$4根据您发布的输入不能包含&\<digit>(如果它可以是一个微不足道的调整):

$ awk -v RS= '!seen[$4]++{sub(/[^\n]+/,$4);print}' file
JH556633.1:35740-45316
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED

看起来您还有另一个问题,所以发布一个新问题,其中包含一些代表性输入和该问题的预期输出。

答案 2 :(得分:0)

sed -n 's/^>\([^ ]\{1,\} \)\{3\}/>/;/^ *$/q;p' YourFile

基于您的示例(posix版本,因此--posix在GNU sed上)