使用perl在一个文件中的列中查找另一文件中的匹配列时,定义错误

时间:2018-08-20 14:49:50

标签: perl loops foreach pattern-matching global-variables

我有一个制表符分隔的输入文件,格式为:

+    Chr1    www
-    Chr2    zzz
...

我想逐行针对以下格式的参考标签分隔文件(以下代码中的TRANSCRIPTS):

Chr1    +    xxx    UsefulInfo1
Chr2    -    yyy    UsefulInfo2
...

并希望输出如下所示:

+    Chr1    UsefulInfo1
-    Chr2    UsefulInfo2
...

这是我尝试从命令行获取变量名,从输入文件中获取某些信息并从参考文件中添加有用的信息:

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

my $inFile = $ARGV[0];
my $outFile = $ARGV[1];

open(INFILE, "<$inFile") || die("Couldn't open $inFile: $!\n");
open(OUTFILE, ">$outFile") || die("Couldn't create $outFile: $!\n");

open(TRANSCRIPTS, "</path/TranscriptInfo") || die("Couldn't open reference file!");
my @transcripts = split(/\t+/, <TRANSCRIPTS>);
chomp @transcripts;

#Define desired information from input for later
while (my @columns = split(/\t+/, <INFILE>)) {
    chomp @columns;
    my $strand = $columns[0];
    my $chromosome = $columns[1];

    #Attempt to search reference file line by line for matching criteria and copying a column of matching lines
    foreach my $reference(@transcripts) {
        my $refChr = $reference[0]; #Error for this line
        my $refStrand = $reference[1]; #Error for this line
        if ($refChr eq $chromosome && $refStrand eq $strand) {
            my $info = $reference[3]; #Error for this line
            print OUTFILE "$strand\t$chromosome\t\$info\n";
        }
    }
}

close(OUTFILE); close(INFILE);

目前,我收到“全局符号“ @reference”需要明确的软件包名称。”定义此的正确方法是什么?即使正确定义了符号,我也无法完全确定我的foreach循环是否可以按预期运行。

任何一般性建议也将不胜感激。原谅我的无知,长期潜伏,没有我自己能教的正规编码训练。

1 个答案:

答案 0 :(得分:1)

已修复:

use strict;
use warnings;
use feature qw( say );

my $in_qfn          = $ARGV[0];
my $out_qfn         = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";

my @transcripts;
{
   open(my $transcripts_fh, "<", $transcripts_qfn)
      or die("Can't open \"$transcripts_qfn\": $!\n");
   while (<$transcripts_fh>) {
      chomp;
      push @transcripts, [ split(/\t/, $_, -1) ];
   }    
}

{
   open(my $in_fh, "<", $in_qfn)
      or die("Can't open \"$in_qfn\": $!\n");
   open(my $out_fh, ">", $out_qfn)
      or die("Can't create \"$out_qfn\": $!\n");
   while (<$in_fh>) {
      chomp;
      my ($strand, $chr) = split(/\t/, $_, -1);
      for my $transcript (@transcripts) {
         my $ref_chr    = $transcript->[0];
         my $ref_strand = $transcript->[1];
         if ($chr eq $ref_chr && $strand eq $ref_strand) {
            my $info = $transcript->[2];
            say $out_fh join("\t", $strand, $chr, $info);
         }
      }
   }
}

也就是说,上面的方法效率很低。让我们将$ transcript_qfn中的行数称为N,将我们将$ in_qfn中的行数称为M。内部循环执行等于N * M的次数。实际上,它只需要执行N次。

use strict;
use warnings;
use feature qw( say );

my $in_qfn          = $ARGV[0];
my $out_qfn         = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";

my %to_print;
{
   open(my $in_fh, "<", $in_qfn)
      or die("Can't open \"$in_qfn\": $!\n");
   while (<$in_fh>) {
      chomp;
      my ($strand, $chr) = split(/\t/, $_, -1);
      ++$to_print{$strand}{$chr};
   }    
}

{
   open(my $transcript_fh, "<", $transcript_qfn)
      or die("Can't open \"$transcript_qfn\": $!\n");
   open(my $out_fh, ">", $out_qfn)
      or die("Can't create \"$out_qfn\": $!\n");
   while (<$transcript_fh>) {
      chomp;
      my ($ref_chr, $ref_strand, $info) = split(/\t/, $_, -1);
      next if !$to_print{$ref_strand};
      next if !$to_print{$ref_strand}{$ref_chr};
      say $out_fh join("\t", $ref_strand, $ref_chr, $info);
   }
}