斯德哥尔摩到快速格式 - 在每个标题中包含加入ID

时间:2015-03-11 07:00:05

标签: perl bioinformatics fasta

您好我有多个斯德哥尔摩格式的序列,在每个路线的顶部都有一个入藏ID,例如:'#= GF AC PF00406 '和' // < / strong>' - &gt;这是对齐的结束。当我将斯德哥尔摩格式转换为fasta格式时,我需要在特定对齐的每个序列的标题中使用 PF00406 。有时在一个文件中会有多个斯德哥尔摩对齐。我试着修改下面的perl脚本,它给了我奇怪的结果,任何帮助都将不胜感激。

my $columns = 60;
my $gapped = 0;

my $progname = $0;
$progname =~ s/^.*?([^\/]+)$/$1/;

my $usage = "Usage: $progname [<Stockholm file(s)>]\n";
$usage .=   "             [-h] print this help message\n";
$usage .=   "             [-g] write gapped FASTA output\n";
$usage .=   "             [-s] sort sequences by name\n";
$usage .=   "      [-c <cols>] number of columns for FASTA output (default is $columns)\n";
# parse cmd-line opts
my @argv;
while (@ARGV) {
    my $arg = shift;
    if ($arg eq "-h") {
    die $usage;
    } elsif ($arg eq "-g") {
    $gapped = 1;
    } elsif ($arg eq "-s"){
    $sorted = 1;
    } elsif ($arg eq "-c") {
    defined ($columns = shift) or die $usage;
    } else {
    push @argv, $arg;
    }
}
@ARGV = @argv;

my %seq;
while (<>) {
    next unless /\S/;
    next if /^\s*\#/;
    if (/^\s*\/\//) { printseq() }
    else {
    chomp;
    my ($name, $seq) = split;
    #seq =~ s/[\.\-]//g unless $gapped;
    $seq{$name} .= $seq;
    }
}
printseq();

sub printseq {
    if($sorted){
        foreach $key (sort keys %seq){
            print ">$key\n";
            for (my $i = 0; $i < length $seq{$key}; $i += $columns){
                print substr($seq{$key}, $i, $columns), "\n";
            }
        }
    } else{
            while (my ($name, $seq) = each %seq) {
            print ">$name\n";
            for (my $i = 0; $i < length $seq; $i += $columns) {
                    print substr ($seq, $i, $columns), "\n";
            }
        }
    }
    %seq = ();
}

1 个答案:

答案 0 :(得分:1)

根据accessionID行中的变化程度,您可能需要修改正则表达式,但这适用于您的示例文件

my %seq;
my $aln;
while (<>) {
    if ($_ =~ /#=GF AC (\w+)/) {
       $aln = $1;
    }
    elsif ($_ =~ /^\s*\/\/\s*$/){
       $aln = '';
    }
    next unless /\S/;
    next if /^\s*\#/;
    if (/^\s*\/\//) { printseq() }
    else {
       chomp;
       my ($name, $seq) = split;
       $name = $name . ' ' . $aln;
       $seq{$name} .= $seq;
    }
}
printseq();