perl中多个行之间的两个相同标签之间的正则表达式perl捕获

时间:2016-09-08 18:13:51

标签: regex perl

我有很多制表符分隔的文本文件,我需要在整个测试中捕获相同单词之间的所有内容,输入看起来像这样

H string
H string
H string
SCAN 00001 00001
I string
I string
432.203 194090 0
SCAN 00002 00002

具有相同的模式(从I开始的几行,然后是我需要捕获的数字),扫描从最低到最高排序,并且它们是唯一的。在两个“SCAN”之间只有数字划分为3列空间分隔,我需要提取第一个和第二个数字,在两个扫描之间有大约两三千行由3个数字组成。

我不是正则表达的神,但我正在尝试这个

while (<$fh_2>)
{
chomp;
next if (/^H/);

my $sc;

if (/(^S.+[\d]+)/../^S.+[\d]+/ms) #while we are between two ^S
{
my @sc_line= split /\s/, $1; #capture the scan number
$sc= pop @sc_line;
if (/(^[\d]+\.?[\d]*)/) # if there are numbers (m) at the start 
   {
    my @lines = split /\s/, $_;
    push @ms, $1; #capture the first number
    push @int, $lines[1]; #capture the second number (i)
    $m{$sc} = [@ms]; #create hash of array
    $in{$sc}= [@int];
   }
}

主要的问题是我希望特定扫描后的所有内容都能以某种方式与该扫描匹配,但由于模式相同,我发现它很难写。

输出必须是数组或多维散列的散列,对于每次扫描,我可以将每个第一个数字(m)和(i)数字相关联,它可以是两个单独的散列,或者不是无关紧要我可以从扫描号码中检索数组 编辑:我用另一种方式解决了它

while (<$fh_2>)
{
chomp;

next if (/^H/);

if (/^S/) 
{ 
my @sc_line= split /\s/, $_;
my $sc_= pop @sc_line;
push @sc, $sc_;
push @count, scalar @int;
}
    elsif (/(^[\d]+\.?[\d]*)/)
    {
    my @lines = split /\s/, $_;
    push @ms, $1;
    push @int, $lines[1];
    }

}
close $fh_2;
每次@sc获取一个新元素时,我都会使用索引来获取@int(或@ms)元素的数量,起初不考虑它是愚蠢的。 仍然有兴趣看看是否有任何TIMTOWTDI魔术正在进行中。

2 个答案:

答案 0 :(得分:0)

尝试以下方法。我希望这提供了一个解决方案,至少可以满足您的需求。建议修改(或修改自己),使其完全符合您的要求。

以下是示例数据:

H string
H string
H string
SCAN 00001 00001
I string
I string
432.203 194090 0
221.332 983451 0
SCAN 00002 00002
H string
H string
H string
SCAN 00001 00003
I string
I string
521.193 182233 0
522.103 171211 0
SCAN 00004 00004

这是剧本:

#!/usr/bin/perl -w
use strict;


# Store information about scans in the form of hash of hashes
my %scans=();

# The current scan number
my $scannumber="";

while (my $line=<>) {

   chomp($line);
   #print "Current Line: $line\n";

   if ($line=~m/^SCAN (\d+) (\g1)/) {
      $scannumber="$1";
      #print "New Scan: $scannumber\n";
   }
   elsif ($line=~m/^(\d+.\d+) (\d+)/) {
      my ($key,$val)=("$1","$2");
      #print "$key : $val\n";
      $scans{$scannumber}{$key}=$val;
   }

}


## You are ready to print the hash of hashes now
#
for my $scannumber (sort keys %scans) {

   for my $key (sort keys %{$scans{$scannumber}}) {

      my $val=$scans{$scannumber}{$key};
      print "$scannumber : $key : $val\n";
   }
}


## You could also print the hash of hashes like this
#
use Data::Dumper;

$Data::Dumper::Terse = 1;
$Data::Dumper::Indent = 2;

print "\n\n";
print Dumper(\%scans);

脚本示例运行:

~> cat data1 | ./script.pl
00001 : 221.332 : 983451
00001 : 432.203 : 194090
00002 : 521.193 : 182233
00002 : 522.103 : 171211


{
  '00002' => {
               '522.103' => '171211',
               '521.193' => '182233'
             },
  '00001' => {
               '221.332' => '983451',
               '432.203' => '194090'
             }
}

答案 1 :(得分:0)

这是作为两个数组哈希的脚本。它们保留每个扫描编号内的扫描线顺序。试一试,看看它是否满足您的需求。

#!/usr/bin/perl -w
use strict;

my %m=();
my %in=();
my $sn="";

while (my $line=<>) {
   chomp($line);

   if ($line=~m/^SCAN (\d+) (\g1)/) {
     $sn="$1";
   }
   elsif ($line=~m/^(\d+.\d+) (\d+)/) {
     push(@{$m{$sn}},$1); push(@{$in{$sn}},$2);
   }
}


## You could also print the hash of hashes like this
#
use Data::Dumper;
$Data::Dumper::Terse = 1;
$Data::Dumper::Indent = 2;
$Data::Dumper::Sortkeys = 1;

print "Here is the dump of results:\n";
print "m = ".(Dumper(\%m))."\nin = ".(Dumper(\%in));


## Printing manually
#
print "\nHere is how you can print them manually:\n";
for my $sn (sort keys %m) {
   for my $i (0..scalar(@{$m{$sn}})-1) {
      print "ScanNumber<$sn> First<$m{$sn}[$i]> Second<$in{$sn}[$i]>\n";
   }
}

以下是示例数据:

H string
H string
H string
SCAN 00001 00001
I string
I string
100.100 100000 0
200.200 200000 0
SCAN 00002 00002
H string
H string
H string
300.300 300000 0
400.400 400000 0
500.500 500000 0
600.600 600000 0
700.700 700000 0
800.800 800000 0
900.900 900000 0
SCAN 00001 00003

以下是命令运行的输出: ./ script.pl&lt;数据

Here is the dump of results:
m = {
  '00001' => [
               '100.100',
               '200.200'
             ],
  '00002' => [
               '300.300',
               '400.400',
               '500.500',
               '600.600',
               '700.700',
               '800.800',
               '900.900'
             ]
}

in = {
  '00001' => [
               '100000',
               '200000'
             ],
  '00002' => [
               '300000',
               '400000',
               '500000',
               '600000',
               '700000',
               '800000',
               '900000'
             ]
}

Here is how you can print them manually:
ScanNumber<00001> First<100.100> Second<100000>
ScanNumber<00001> First<200.200> Second<200000>
ScanNumber<00002> First<300.300> Second<300000>
ScanNumber<00002> First<400.400> Second<400000>
ScanNumber<00002> First<500.500> Second<500000>
ScanNumber<00002> First<600.600> Second<600000>
ScanNumber<00002> First<700.700> Second<700000>
ScanNumber<00002> First<800.800> Second<800000>
ScanNumber<00002> First<900.900> Second<900000>