Perl如何找到捕获的位置

时间:2012-07-16 09:47:54

标签: regex perl

我有一个像这样的空格分隔文件:

 First        Second        Third       Forth
 It               is        possible    to   
 do             this                    task
 with          regex        but         i
 don't          know        how         to 

我的任务是捕获每一行的所有单词并从中构造一个哈希值。

但这是我的问题:任何列中的字段可能为空(例如第3行,第3个字段)。

每行中的单词在列的开头或结尾对齐。 (列的名称是第一行中的单词,例如First Second Third Forth

在我的示例中,单词在First Third Forth列中与左侧(或列名称的开头)对齐,并在Second

中与右侧(或列名末尾)对齐

使用每行的哈希我必须创建如下格式的输出:

$hash{First} has Second-property $hash{Second}. It also has $hash{Third} and $hash{Forth}.

use File::Basename;
use locale;
open my $file, "<", $ARGV[0];
open my $file2,">>",fileparse($ARGV[0])."2.txt";
my @alls = <$file>;

sub Main{
my $first = shift @alls;
my $poses = First_And_Last($first);
my $curr_poses;
my $curr_hash;
#do{OutputLine($_->[0],$_->[1],$first)}for (@$poses);
my $result_array=[];
my @keys = qw(# Variable Type Len Format Informat Label);
for $word(@alls){
    $curr_poses=First_And_Last($word);
    undef ($curr_hash);
    $curr_hash = Take_Words($poses, $word, $curr_poses);
    push @{$result_array},$curr_hash; #AoH  
    }

#end of main
}

sub First_And_Last{
    #First_And_Last($str)
    my $str = shift;    
    my $begin;
    my $end;
    my $ref=[];
    while ($str=~m/(([\S\.]\s?)+\b|#)/g){       
        $begin = pos($str) - length($1);
        $end = pos($str);       
        push @{$ref},[$begin,$end];
        }               
    return $ref;
    }

sub Take_Words{
    #Take_Words($poses, $line,$current) 
    my $outref = {};
    my $ref = shift; #take the ref of offsets of words
    my $line = shift;# and the next line in file
    my $current = shift; # and this is the poses of current line
    my @keys = qw(# Variable Type Len Format Informat Label);
    do{$outref->{$_}=undef;}for(@keys);
    my $ethalon; #for $ref
    my $relativity; #for $current
    my $key; #for key in $outref
    my @ethalon = @{$ref};

    $ethalon = shift @ethalon;
    $relativity = shift @{$current};
    $key = shift @keys;

    while (defined($key) && defined($relativity)){
        if ($ethalon->[0] == $relativity->[0] || $ethalon->[1] == $relativity->[1]){    
                $outref->{$key} = substr($line, $relativity->[0],$relativity->[1] - $relativity->[0]);          

                $relativity = shift @{$current};
            }
            $ethalon = shift @ethalon;
            $key = shift @keys;         
        }


    return $outref;
    }

1 个答案:

答案 0 :(得分:2)

这是我的算法,但它有点像C-ish:

  1. 确定每个列标题的起始位置并存储它。

  2. 对于每一栏:转到标题起始位置。

  3. 向左走,直到你连续两个空格。

  4. 右转两个字符,然后记住位置。

  5. 向右走直到你连续两个空格。

  6. 左转两个字符,然后记住这个位置。

  7. 在找到的边界之间提取所有内容。

  8. 删除起始和尾随空格。

  9. 存储在哈希

  10. 从第2步开始重复

  11. 现在我们必须看看这个实现:

    第1步:

    my @starting;
    {
      my @char = split m{}, <$file>; # split the first line into char array
      my $spacecount = 0;
      my $state = 1; # 1 : find start -- 0 : find end
      for (my $i = 0; $i < @char; $i++) {
        if ($state) { # find next non-space
          if ($char[$i] =~ /\s/) {
            next;
          } else {
            $state = not $state; # flip
            $spacecount = 0;
            push @starting, $i;
            next;
          }
        } else {
          if ($char[$i] =~ /\s/) {
            $spacecount++;
            if ($spacecount >= 2) {
              $state = not $state; # flip
              next;
            }
          } else {
            $spacecount = 0; # reset consecutive space counter
            next;
          }
        }
      }
    }