@file1
仅包含起始点 - 端点对,每个索引代表每对。 file2
是一个文本文件,对于@file2
,每个索引代表每一行。我想逐行@file1
逐@file2
搜索每对。找到完全匹配后,我会尝试从information1
中提取file2
并将其打印出来。但就目前而言,我正在尝试在file2
中搜索匹配对。匹配模式的格式如下:
$file1[0]
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
file2
包含:Line with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:
对于文字files2
,我将其存储在@file2
中。对于files1
,我已成功提取并存储了每个Startpoint和下一行Endpoint作为@file1
中的格式。 (在提取和存储每一对方面没有问题,所以我不会为此显示代码,这里需要大约4分钟)然后我分割@address
的每个元素,它们是起始点和端点。在files2
中逐行检查,如果起始点匹配,那么我将在下一行上移动以检查端点,如果Startpoint之后的下一行与端点匹配,则仅视为匹配,否则尝试再次搜索直到结束行files2
。这个脚本可以完成这项工作,但需要3个半小时才能完成(file1
有大约6万对,而file2
则有800k行)。还有其他有效的方法吗?
我是Perl脚本的新手,我为我的解释和编码中的任何愚蠢错误道歉。 这是代码:
#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (@file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into @file1, so I wont include
#the code for this
$size = scalar@file1;
$size2 = scalar@file2;
for ( $total=0; $total<$size; $total++ ) {
my @file1_split = split('\n',$file1[$total]);
chomp @file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pair\n";
last LABEL2; #move on to check next start-endpoint
#pair
}
else {
$split = 0; #reset back to check the same startpoint
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] )
{
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1; #move on next line to match endpoint
next;
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)\n";
}
}
}
答案 0 :(得分:-1)
如果我了解您的规格(显示比赛),我将在不到5秒的时间内完成投注,除非您使用旧的Dell D333。为了进一步减少响应时间,您可以编写一些额外的代码来通过最少的密钥(您隐含的文件1)来驱动while循环。如果使用对哈希的引用,则可以编写一个小的if-else语句来交换哈希引用,而不必编写重复的while语句。
use strict;
use warnings;
sub makeHash($) {
my ($filename) = @_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1; # captured group in regular expression
$line = $.; # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %s\n";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matches\n";
下面是我的测试数据生成器(thanks)。我生成了两个相同文件之间1,000,000匹配的最坏情况。上面的匹配代码使用生成的测试数据在20秒内完成了我的MBP。
use strict;
use warnings;
sub rndStr { join'', @_[ map{ rand @_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $start\n";
print F1 "Endpoint: $end\n";
print F2 "Startpoint: $start\n";
print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);
答案 1 :(得分:-1)
如果我了解您的代码尝试做什么, 看起来这样做会更有效率:
my %split=@file1;
my %total;
@total{@file1}=(0..$#file1);
my $split;
for( @file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pair\n";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})\n";
}