Question

我试图找到基于一个字段的两个制表符分隔文件中的公共行。第一个文件的一行：

@Optional @Input String conversion

第二个文件的一行：

1       52854   s64199.1        A       .       .       .       PR      GT      0/0

基于此示例中的第二个字段（52854），我有很多。这是我的代码，它找到常见的，但我的文件非常大，需要花费很多时间。有没有办法加快这个过程？非常感谢你提前。

chr1    52854     .       C       T       215.302 .       AB=0.692308;ABP=7.18621;AC=1;AF=0.5;AN=2;AO=9;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=3.25157;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=17.5429;PAIRED=0;PAIREDR=0.25;PAO=0;PQA=0;PQR=0;PRO=0;QA=318;QR=138;RO=4;RPP=3.25157;RPPR=5.18177;RUN=1;SAF=0;SAP=22.5536;SAR=9;SRF=1;SRP=5.18177;SRR=3;TYPE=snp;technology.illumina=1;BVAR  GT:DP:RO:QR:AO:QA:GL    0/1:13:4:138:9:318:-5,0,-5

Answer 1

请在下面找到针对基于哈希的搜索的脚本的最小修改

use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];

my %vcf_hash;
open( my $vcf_info, $vcf_file) or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>)  {
    next if $line =~ m/^#/; # Skip comment lines
    chomp $line;
    my (@data) = split(/\t/, $line);
    die unless @data >= 10; # Check number of fields in the input line
    my ($pos) = $data[1];
    # $. - line number in the file
    $vcf_hash{$pos}{$.} = \@data;
}

open( my $map_info, $map_file) or die "Could not open $map_file: $!";
while( my $mline = <$map_info>)  {
    chomp $mline;
    my (@data) = split(/\t/, $mline);
    die unless @data >= 2; # Check number of fields in the input line
    my ($pos) = $data[1];
    if( exists $vcf_hash{$pos}) {
      my $hash_ref = $vcf_hash{$pos};
      for my $n (sort{$a<=>$b} keys %$hash_ref) {
        my $array_ref = $hash_ref->{$n};
        my $pos2     = $array_ref->[1];
        my $ref2     = $array_ref->[3];
        my $allele   = $array_ref->[4];
        my $genotype = $array_ref->[9];
        print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
      }
    }
}

如果使用大量数据文件，可能会进一步改进脚本以减少内存使用。

Answer 2

这是一个应该比你自己的

跑得快得多的版本

它读取映射文件并将每个pos字段存储在散列%wanted中。然后它读取第二个文件并检查记录是否在所需值列表中。如果是，那么它会拆分记录并打印您需要的字段

请注意，除了确保编译

之外，我无法对此进行测试

use strict;
use warnings;
use 5.010;
use autodie;

my ( $map_file, $vcf_file ) = @ARGV;

my %wanted;

{
    open my $map_fh, '<', $map_file;

    while ( <$map_fh> ) {
        chomp;
        my $pos = ( split /\t/, $_, 3 )[1];
        ++$wanted{$pos};
    }
}

{
    open my $vcf_fh, '<', $vcf_file;

    while ( <$vcf_fh> ) {

        next if /^#/;

        chomp;
        my $pos = ( split /\t/, $_, 3 )[1];
        next unless $wanted{$pos};

        my ( $ref, $allele, $genotype ) = ( split /\t/ )[3, 4, 9];
        print join("\t", $pos, $ref, $allele, $genotype), "\n";

    }
}

Answer 3

无需将map_file保留在内存中，只需将密钥保存在内存中。最好将它们放在用于存在检查的哈希中。您也不必将vcf_file保留在内存中，但您可以决定是否输出。

#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
use autodie;

use constant KEY => 1;
use constant FIELDS => ( 1, 3, 4, 9 );

my ( $map_file, $vcf_file ) = @ARGV;

my %map;
{
    my $fh;
    open $fh, '<', $map_file;

    while (<$fh>) {
        $map{ ( split /\t/, $_, KEY + 2 )[KEY] } = undef;
    }
}

{
    my $fh;
    open $fh, '<', $vcf_file;
    while (<$fh>) {
        next if /^#/;
        chomp;
        my @data = split /\t/;
        print join "\t", @data[FIELDS] if exists $map{ $data[KEY] };
    }
}

如何快速找到两个阵列中的常用项？

3 个答案: