查找2个文件之间的匹配(如何提高效率)

时间:2017-07-12 03:17:03

标签: regex perl grep compare match

@file1仅包含起始点 - 端点对,每个索引代表每对。 file2是一个文本文件,对于@file2,每个索引代表每一行。我想逐行@file1@file2搜索每对。找到完全匹配后,我会尝试从information1中提取file2并将其打印出来。但就目前而言,我正在尝试在file2中搜索匹配对。匹配模式的格式如下:

匹配案例

来自$file1[0]

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

匹配,如果file2包含:

Line with other stuff
Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff

不匹配案例:

来自file1:

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

来自file2:

Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:

对于文字files2,我将其存储在@file2中。对于files1,我已成功提取并存储了每个Startpoint和下一行Endpoint作为@file1中的格式。 (在提取和存储每一对方面没有问题,所以我不会为此显示代码,这里需要大约4分钟)然后我分割@address的每个元素,它们是起始点和端点。在files2中逐行检查,如果起始点匹配,那么我将在下一行上移动以检查端点,如果Startpoint之后的下一行与端点匹配,则仅视为匹配,否则尝试再次搜索直到结束行files2。这个脚本可以完成这项工作,但需要3个半小时才能完成(file1有大约6万对,而file2则有800k行)。还有其他有效的方法吗?

我是Perl脚本的新手,我为我的解释和编码中的任何愚蠢错误道歉。 这是代码:

#!usr/bin/perl
use warnings;

my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (@file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into @file1, so I wont include 
#the code for this 
$size = scalar@file1;
$size2 = scalar@file2;

for ( $total=0; $total<$size; $total++ ) {
   my @file1_split = split('\n',$file1[$total]);
   chomp @file1_split;
   my $match_endpoint = 0;
   my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
           if ( $match_endpoint == 1) {
              if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
              print"Pair($total):Match Pair\n";
              last LABEL2;         #move on to check next start-endpoint 
                                   #pair 
              }
              else {
              $split = 0;          #reset back to check the same startpoint 
              and continue searching until match found or end line of file2
              $match_endpoint = 0;
              }
            }
            elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] ) 
           { 
           $match_endpoint = 1;#enable search for endpoint in next line
           $split = 1;         #move on next line to match endpoint
           next;  
           }
           elsif ( $count==$size2-1 ) {
           print"no matching found for Path($total)\n";
           }
        }
}

2 个答案:

答案 0 :(得分:-1)

如果我了解您的规格(显示比赛),我将在不到5秒的时间内完成投注,除非您使用旧的Dell D333。为了进一步减少响应时间,您可以编写一些额外的代码来通过最少的密钥(您隐含的文件1)来驱动while循环。如果使用对哈希的引用,则可以编写一个小的if-else语句来交换哈希引用,而不必编写重复的while语句。

use strict;
use warnings;

sub makeHash($) {
    my ($filename) = @_;
    open(DATA, $filename) || die;
    my %result;
    my ($start, $line);
    while (<DATA>) {
        if ($_ =~ /^Startpoint: (.*)/) {
            $start = $1;    # captured group in regular expression
            $line = $.;     # current line number
        } elsif ($_ =~ /^Endpoint: (.*)/) {
            my $end = $1;
            if (defined $line && $. == ($line + 1)) {
                my $key = "$start::$end";
                # can distinguish start and end lines if necessary
                $result{$key} = {start=>$start, end=>$end, line=>$line};
            }
        }
    }
    close(DATA);
    return %result;
}

my %file1 = makeHash("file1");
my %file2 = makeHash("file2");

my $fmt = "%10s %10s %s\n";
my $nmatches = 0;

printf $fmt, "File1", "File2", "Key";

while (my ($key, $f1h) = each %file1) {
    my $f2h = $file2{$key};
    if (defined $f2h) {
        # You have access to hash members start and end if you need to distinguish further
        printf $fmt, $f1h->{line}, $f2h->{line}, $key;
        $nmatches++;
    }
}
print "Found $nmatches matches\n";

下面是我的测试数据生成器(thanks)。我生成了两个相同文件之间1,000,000匹配的最坏情况。上面的匹配代码使用生成的测试数据在20秒内完成了我的MBP。

use strict;
use warnings;

sub rndStr { join'', @_[ map{ rand @_ } 1 .. shift ] }

open(F1, ">file1") || die;
open(F2, ">file2") || die;

for (1..1000000) {
    my $start = rndStr(30, 'A'..'Z');
    my $end = rndStr(30, 'A'..'Z');
    print F1 "Startpoint: $start\n";
    print F1 "Endpoint: $end\n";
    print F2 "Startpoint: $start\n";
    print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);

答案 1 :(得分:-1)

如果我了解您的代码尝试做什么, 看起来这样做会更有效率:

my %split=@file1;
my %total;
@total{@file1}=(0..$#file1);
my $split;
for( @file2 ){
    if( $split ){
      if( $_ eq $split ){
         print"Pair($total{$split}):Match Pair\n";
      }else{
         $split{$split}="";
      }
    }
    $split=$split{$_};
    delete $split{$_};
}
for( keys %split ){
  print"no matching found for Path($total{$_})\n";
}