Perl挂起,可能与打印有关

时间:2016-07-08 00:47:24

标签: perl read-eval-print-loop genetics

我对这个感到茫然。我有一个perl脚本:
1.在目录中处理Genbank文件(非常混乱和不一致)(例如GBK文件:ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Acetobacter_pasteurianus_IFO_3283_01_uid31129/AP011121.gbk
2.通过基因分割每个文件
3.在foreach循环中,获取有关每个基因的相关信息 4.在每个循环结束时打印有关该基因的信息

问题:它在特定文件的中间随机挂起,但它们或它们停留的基因没有明显不同,文件分散在整个> 72K总文件中。当它挂起时,打印到命令行的输出是打印到文件的输出前面的几个循环(基因)(见图)。

故障排除:它停止的变量对于不同的文件是不同的,它有时挂起打印中间变量,RAM / CPU使用率低,当它挂起时仍然使用内存/ CPU,它挂在窗口上(最新版本的草莓perl)和linux(flux HPC)系统,输出没有任何问题(因为命令行输出在文件输出之前,我可以看到它可以并且确实处理它在打印期间挂起的基因文件)。

对不起代码的长度,我是一名微生物学家,所以它不像我在stackoverflow上看到的一些代码那样聪明/简写(同样Genbank文件在格式/语言方面非常不一致所以我必须编程那)。我很乐意实施其他代码建议。

#!/usr/bin/perl
use warnings;

##### open output files #####
$GBKINFO    = 'L:\NCBI_DAT\GBFF\Bacteria_tablist.txt';
open(GBKINFO,'>', $GBKINFO)||die "unable to open $GBKINFO:$!\n";
$debug      = 'L:\NCBI_DAT\GBFF\Bacteria_debug.txt';
open(DEBUG,'>', $debug)||die "unable to open $debug:$!\n";
$GBKGENOMES = 'L:\NCBI_DAT\GBFF\Bacteria_taxonomy.txt';

##### load taxonomy hash #####
$taxons = 'R:\1_Downloads\taxa.Bacteria.dat';
open(TAXONS, $taxons) || die "unable to open $taxons: $!\n";
my %TAXhash; 
while(<TAXONS>){ 
(my $orgID, my $phylog)=split('\t',$_); $TAXhash{$orgID}=$phylog;}
close(TAXONS);

##### load protIDs hash #####
$protid = 'R:\1_Downloads\geneinfo.Bacteria.dat';
open(PROTID, $protid) || die "unable to open $taxons: $!\n";
my %IDhash; 
while(<PROTID>){ 
(my $prot, my $ID)=split('\t',$_); 
$prot =~ s/\s//g; $ID =~ s/\n//g;
$IDhash{$prot}=$ID;}
close(PROTID);

##### get genbank files #####
my $dir = 'L:\NCBI_DAT\GBFF\Bacteria';
die unless opendir DIR, $dir;
foreach my $file (readdir DIR) {
    $/="\n//\n";
    next if $file eq '.' or $file eq '..';
    $gbk = $dir.'/'.$file;
    $gbk =~ s/\//\\/g; 
    $gbk =~ /GCA_(\d+)/;

#######  open .gbk file and split contigs   #######
if(-f $gbk && $gbk =~ /\.(gbk|gbff)/ && $gbk !~ /gz$/ ){
    open(GBK, $gbk) || die "unable to open $gbk: $!\n";
    $i=1; $count =0; 

    while(<GBK>){
        $BIG=$_;
        $BIG =~ s/[\@<>\%\n]//g;
        if($BIG =~ /(\s+CDS\s{5}|\s+\S*RNA\s{5})/){

            # Get Phylogeny
            $BIG =~ ~ /db_xref\=\"taxon:(\d+)/; 
            $taxID=$1; $count = 1;
            $BIG =~ /ORGANISM\s+(.*?)\s+(\w+\;.*?)\./; 
            $SPECIES = $1; $PHYLOGENY = $2;
            $PHYLOGENY =~ s/\(.*?\)//g;  $PHYLOGENY =~ s/[^\w\;]//g; 
            $taxonomy = $TAXhash{$taxID}; 
            if($taxonomy =~ /\w/){ 
                $taxonomy =~ s/(\s+\;$|\s+$|\s+\;\s+$)//g; 
                $PHYLOGENY = $taxonomy;} 
            else{$PHYLOGENY=$PHYLOGENY."\;".$SPECIES;} 
            $PHYLOGENY =~ s/[\t\n]//g;

            if($PHYLOGENY =~ /Bacteria/i){$org="B";}
            elsif($PHYLOGENY =~ /Virus/i){$org="V";}
            elsif($PHYLOGENY =~ /Fungi/i){$org="F";}
            elsif($PHYLOGENY =~ /Archaea/i){$org="A";}
            elsif($PHYLOGENY =~ /Chordata/i){$org="C";}
            elsif($PHYLOGENY =~ /(Viridiplantae|Stramenopiles|Rhodophyta)/i){$org="P";}
            elsif($PHYLOGENY =~ /Eukaryota/i && $org !~ /[ABCFHPRV]/){$org="I";}
            else{$org="U";}

            # Print Phylogeny
            open(GENO, '>>', $GBKGENOMES)||die "unable to open $GBKGENOMES:$!\n";
            print GENO "$taxID\t$PHYLOGENY\n"; close(GENO);

            # get genome seq ###
            $BIG =~ /ORIGIN(.+)/;
            $GenomeSeq=$1;
            $GenomeSeq =~ s/[^a-z]//ig;
            $GenomeSeq = uc($GenomeSeq);
            if(length($GenomeSeq)<100){next;} # eg Bos Taurus genome had no gene seqs

            # split file by genes 
            $BIG =~ /VERSION\s{5,}(\w.*?)\s/; $Accession = $1;
            $BIG =~ s/(\s+gene\s{5})/\%$1/g;
            $BIG =~ s/(\s+[a-z]RNA\s{5})/\%$1/g;
            $BIG =~ s/(\s+CDS\s{5})/\%$1/g;
            $BIG =~  s/order\((\d+)\W*.*?\W(\d+)\)+/$1\.\.$2/g;
            $BIG =~  s/join\((\d+)\W*.*?\W(\d+)\)+/$1\.\.$2/g;
            @genes = split("\%",$BIG); $junk=shift(@genes);

            # get gene info 
            foreach(@genes){ 
                $gline = $_;
                if($gline =~ /^\s+gene\s+/){next;}

                # get gene type
                if($gline =~ /\s+CDS\s+[\dc]/)      {$type = "Protein";}
                elsif($gline =~ /\s\/pseudo/)           {$type = "Pseudo";}
                elsif($gline =~ /\s+\S*[^mr]RNA\s{5}/){$type = "ncRNA"; 
                    $gline =~ /\/note\=\".*\;*(.*)\"/; $LOC=$1; 
                    if($LOCUS !~ /\w/){$LOCUS=$LOC;}}
                elsif($gline =~ /\s+rRNA\s{5}/)     {$type = "rRNA";}
                elsif($gline =~ /\s+tRNA\s{5}/)     {$type = "tRNA";}
                else{next;}

                # get gene names and ids
                if($gline =~ /\/note\=\".*(COG\d\d\d\d)/)       {$COG = $1; $COG =~ s/\s//g;}   else{$COG ='';}
                if($gline =~ /\/note\=\".*:(K\d\d\d\d\d)/)  {$KO = $1; $KO =~ s/\s//g;} else{$KO ='';}
                if($gline =~ /\/locus_tag\=\"(.*?)\"/)      {$LOCUS = $1; $LOCUS =~ s/\s//g;}   else{$LOCUS ='';}
                if($gline =~ /\/protein_id\=\"(.*?)\"/)     {$ProtID = $1; $ProtID =~ s/\s//g;} else{$ProtID ='';}
                if($gline =~ /\/product\=\"(.*?)\"/)            {$Product = $1;}    else{$Product ='';}
                if($gline =~ /\/gene\=\"(.*?)\"/)               {$GName = $1;}  else{$GName ='';}
                if($gline =~ /\/inference\=\".*(RF\d+)\"/)  {$Rfam = $1;}   else{$Rfam ='';}
                if($gline =~ /\/translation\=\"([\w\s]+)\"/)    {$AAseq = $1; $AAseq =~ s/\s//g;}   else{$AAseq ='';}

                # get gene seq and coords                   
                if($gline =~/(RNA|CDS)\s+(\d+)\D*\.\.\D*(\d+)/){ 
                    if($2>$3){$start = $3; $end = $2;} 
                    else{$start = $2; $end = $3;}
                    $strand = "\+"; $seq= substr $GenomeSeq, $start-1, $end-$start+1;}
                elsif($gline =~/(RNA|CDS)\s+compl\S*?(\d+)\D*\.\.\D*(\d+)/){
                    if($2>$3){$start = $3; $end = $2;}
                    else{$start = $2; $end = $3;} 
                    $strand = "\-"; $seq= substr $GenomeSeq, $start-1, $end-$start+1;
                    $seq =~ tr/atgcrykmbvhdATGCRYKMBVHD/tacgyrmkvbdhTACGYRMKVBDH/; 
                    $rseq=reverse($seq); $seq=$rseq;}
                else{print DEBUG "no coords $gline\t$gbk\n"; next;}

                $seq=uc($seq); $Glen = length($seq); $coords = "$start\.\.$end";
                if($Glen < 5){print DEBUG "gene length issue $gline\t$gbk\n"; next;}

                # get gene IDs
                print "prot id $ProtID and $gbk\n";
                $IDS = $IDhash{$ProtID}; $IDS =~ s/\n//g; $IDS =~ s/.*\&//; $Func = ''; $DATname = '';
                if($IDS =~ /\#/ ){($DATname, $Func) = split("\#", $IDS); $Func =~ s/(\s+$|^\s+)//;} 
                if($Func !~ /$COG/ && $COG =~ /COG\d\d\d\d/){$Func = $COG."\@".$Func;}
                if($Func !~ /$KO/ && $KO =~ /K\d\d\d\d\d/){$Func = $KO."\@".$Func;}
                $Func =~ s/(\@$|^\@)//g;

                # fix gene name issues
                if($GName =~ /((hypothetical|uncharacterized|conserved|predicted)\s+protein|unknown function|scaffold|contig)/i || length($GName)<3 || $GName !~ /\w/){
                       if(length($Product)>length($GName)   && $Product !~ /((hypothetical|uncharacterized|conserved|predicted)\s+protein|unknown function|scaffold|contig)/i){$GName=$Product;}
                    elsif(length($DATname)>length($GName)   && $DATname !~ /((hypothetical|uncharacterized|conserved|predicted)\s+protein|unknown function|scaffold|contig)/i){$GName=$DATname;}
                else{ if($type =~ /(protein|pseudo)/i){$GName = "Uncharacterized protein";} else{$GName = "Uncharacterized gene";}}}                
                $GName =~ s/([\;\,\.\@\<\>\%\|]|\(.*\))//g; $GName =~ s/(\s$|^\s)//g; $GName =~ s/\s+/_/g;
                if($LOCUS !~ /\w/){$LOCUS = $Accession."&".$coords;}

                $FINAL = "$LOCUS\t$ProtID\t$GName\t$type\t$Glen\t$strand\t$taxID\t$org\t$Func\t$AAseq\t$seq"; $FINAL =~ s/\n//g;
                if($LOCUS =~ /\w/){print GBKINFO "$FINAL\n";}

            } # foreach gene
                last;
            } # if big matches protein
            else{$i++; print "no protein $i\n"; next;}
        } # close while gbk
        close(GBK)||die "unable to close GBK:$!\n"; #just added to check it is closing

        if($count==0){ print "no genes unlinked $gbk\n"; unlink $gbk or warn    "Could not unlink $gbk: $!"; next;}
    } # closes 1st if for getting genomes
} # closes 1st foreach file

close(GBKINFO);
close(DEBUG);

Image showing command line print is ahead of print to file and memory use is fine

1 个答案:

答案 0 :(得分:0)

谢谢!我尝试使用通用$ | = 1来刷新缓冲区;它不适用于所有嵌套的FH,然后是FH特定的FH使用:
选择((选择(FH),$ | = 1)[0]);
它帮助了我找出它挂在哪里...一个正则表达式与一些gbk文件的混乱很好地融合。糟糕的正则表达式 - &gt; $ gline =〜// note \ = \&#34;。 \; (。*)\&#34; /;