在一条线上搜索和标记成对的图案

时间:2012-03-12 15:19:38

标签: ruby perl bash python-2.7

我需要搜索并标记在一条线上某处分割的图案。下面是一个简短的样本模式列表,它放在一个单独的文件中,例如:

CAT,TREE
LION,FOREST
OWL,WATERFALL

如果第2列中的项目出现在与第1列中的项目相同的行之后,则会出现匹配。例如:

THEREISACATINTHETREE. (matches)

如果第2列中的项目首先出现在该行上,则不会出现匹配,例如:

THETREEHASACAT. (does not match)

此外,如果第1列和第2列中的项目触摸,则不会出现匹配,例如:

THECATTREEHASMANYBIRDS. (does not match)

找到任何匹配后,我需要将其标记为\start{n}(出现在第1列项目之后)和\end{n}(出现在第2列项目之前),其中n为一个简单的计数器,可以随时增加任何匹配。 E.g:

THEREISACAT\start{1}INTHE\end{1}TREE.

这是一个更复杂的例子:

THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.

这变为:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.

有时在同一个地方有多个匹配项:

 THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.

这变为:

 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
  • 文件中没有空格。
  • 文件中出现了许多非拉丁字符。
  • 模式匹配只需要在同一行上找到(例如,第1行的“CAT”与第2行的“TREE”不匹配,因为它们位于不同的行上)。

如何找到这些匹配并以这种方式标记它们?

7 个答案:

答案 0 :(得分:6)

以下是一种Perl方法:

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;

# couples of patterns to search for
my @patterns = (
    ['CAT', 'TREE'],
    ['LION', 'FOREST'],
    ['OWL', 'WATERFALL'],
);

# loop over all sentences
while (my $line = <DATA>) {
    chomp $line;    #remove linefeed
    my $count = 1;  #counter of start/end
    foreach my $pats (@patterns) {
        #$p1=first pattern, $p2=second
        my ($p1, $p2) = @$pats;

        #split on patterns, keep them, remove empty
        my @s = grep {$_} split /($p1|$p2)/, $line;

        #$start=position where to put the \start
        #$end=position where to pt the \end
        my ($start, $end) = (undef, undef);

        #loop on all elements given by split
        for my $i (0 .. $#s) {
            # current element
            my $cur = $s[$i];

            #if = first pattern, keep its position in the array
            if ($cur eq $p1) {
                $start = $i;
            }

            #if = second pattern, keep its position in the array
            if ($cur eq $p2) {
                $end = $i;
            }

            #if both are defined and second pattern after first pattern
            # insert \start and \end
            if (defined($start) && defined($end) && $end > $start + 1) {
                $s[$start] .= "\\start{$count}";
                $s[$end] = "\\end{$count}" . $s[$end];
                undef $end;
                $count++;
            }
        }
        # recompose the line
        $line = join '', @s;
    }
    say $line;
}

__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE

<强>输出:

THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACAT\start{1}INTHE\end{1}TREE.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT\start{1}...\end{1}TREE...CAT\start{2}...\end{2}TREE

答案 1 :(得分:5)

检查一下(Ruby):

#!/usr/bin/env ruby
patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
  'CAT...TREE...CAT...TREE'
]

lines.each do |line|
  puts line
  matches = Hash.new{|h,e| h[e] = [] }
  match_indices = []
  patterns.each do |first,second|
    offset = 0
    while new_offset = line.index(first,offset) do
      # map second element of the pattern to minimal position it might be matched
      matches[second] << new_offset + first.size + 1
      offset = new_offset + 1
    end
  end
  global_counter = 1
  matches.each do |second,offsets|
    offsets.each do |offset|
      second_offset = offset
      while new_offset = line.index(second,second_offset) do
        # register the end index of the first pattern and 
        # the start index of the second pattern with the global match count
        match_indices << [offset-1,new_offset,global_counter]
        second_offset = new_offset + 1
        global_counter += 1
      end
    end
  end
  indices = Hash.new{|h,e| h[e] = ""}
  match_indices.each do |first,second,global_counter|
    # build the insertion string for the string positions the 
    # start and end tags should be placed in
    indices[first] << "\\start{#{global_counter}}"
    indices[second] << "\\end{#{global_counter}}"
  end
  inserted_length = 0
  indices.sort_by{|k,v| k}.each do |position,insert|
    # insert the tags at their positions
    line.insert(position + inserted_length,insert)
    inserted_length += insert.size
  end
  puts line
end

结果

THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE

修改

我插入了一些评论并澄清了一些变量。

答案 2 :(得分:2)

首先,您必须从模式中找到所有出现的开始和结束字符串。然后你需要找出哪些标签组合在一起(如果结束字符串位于起始字符串之前或者位于相同位置并因此接触,则它们不适合)。然后你可以生成你的标签并插入你的输出字符串。请注意,您需要将插入的字符数添加到您的位置,因为插入标记时字符串的长度会发生变化。此外,您必须在插入之前按位置对标记进行排序,否则计算起来会变得非常复杂,您必须在多远的位置移动位置。这是Ruby中的一个简短示例:

patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']

strings.each do |string|
  matches = {}; tags = []
  counter = shift = 0
  output = string.dup

  patterns.each do |sstr,estr|                # loop through all patterns
    posa = []; posb = [];                     #
    string.scan(sstr){posa << $~.end(0)}      # remember found positions and
    string.scan(estr){posb << $~.begin(0)}    # find all valid combinations (next line)
    matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
  end

  matches.each do |pat,pos|                   # loop through all matches
    pos.each do |s,e|                         # 
      tags << [s,"\\start{#{counter += 1}}"]  # generate and remember \start{}
      tags << [e,"\\end{#{counter}}"]         # and \end{} tags
    end
  end

  tags.sort.each do |pos,tag|                 # sort and loop through tags
    output.insert(pos+shift,tag)              # insert tag and increment
    shift += tag.chars.count                  # shift by num. of inserted chars
  end

  puts string, output                         # print result
end

它不漂亮,但它符合您的所有要求。我认为下一个示例更具可读性和可重用性,并且它被实现为具有相应单元测试的Ruby类,以确保其工作:

class PatternMarker
  require 'english'

  attr_reader :input, :output, :matches

  def initialize patterns
    @patterns = patterns
    raise ArgumentError, 'no patterns given' unless @patterns.any?
    @patterns.each do |p|
      raise ArgumentError, 'every pattern must have exactly two strings' unless p.count == 2
    end
  end

  def parse input
    @input = input.dup
    match_patterns
    generate_output
    self
  end

  def match?
    @matches.any?
  end

private

  def match_patterns
    @matches = {}
    @patterns.each do |start_str,end_str|
      pos = { :start => [], :end => [] }
      @input.scan(start_str){ pos[:start] << $LAST_MATCH_INFO.end(0)   }
      @input.scan(end_str  ){ pos[:end]   << $LAST_MATCH_INFO.begin(0) }
      @matches[[start_str,end_str]] = pos[:start].product(pos[:end])
      @matches[[start_str,end_str]].reject!{ |s,e| e <= s }
      @matches.reject!{ |p,pos| pos.none? }
    end
  end

  def generate_output
    tags = []
    counter = shift = 0
    @output = @input.dup

    @matches.each do |pattern,positions|
      positions.each do |s,e|
        counter += 1
        tags << [s, "\\start{#{counter}}"]
        tags << [e, "\\end{#{counter}}"  ]
      end
    end

    tags.sort!.each do |position,tag|
      @output.insert(position+shift,tag)
      shift += tag.chars.count
    end
  end
end

行动中:

patterns = [
  ['CAT' , 'TREE'     ],
  ['LION', 'FOREST'   ],
  ['OWL' , 'WATERFALL']
]

strings = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
  'ACATONATREEANDANOTHERCATONANOTHERTREE.',
  'ACATONATREEBUTNOCATTREE.'
]

marker = PatternMarker.new(patterns)

strings.each do |string|
  marker.parse(string)

  puts "input: #{marker.input}"

  if marker.match?
    puts "output: #{marker.output}"
  else
    puts "(does not match)"
  end
  puts
end

输出:

input: THEREISACATINTHETREE.
output: THEREISACAT\start{1}INTHE\end{1}TREE.

input: THETREEHASACAT.
(does not match)

input: THECATTREEHASMANYBIRDS.
(does not match)

input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.

input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.

input: ACATONATREEANDANOTHERCATONANOTHERTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.

input: ACATONATREEBUTNOCATTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.

测试:

require 'test/unit'

class TestPatternMarker < Test::Unit::TestCase
  def setup
    @patterns = [
      ['CAT' , 'TREE'     ],
      ['LION', 'FOREST'   ],
      ['OWL' , 'WATERFALL']
    ]

    @marker = PatternMarker.new(@patterns)
  end

  def test_should_parse_simple
    @marker.parse 'THEREISACATINTHETREE.'
    assert @marker.match?
    assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
  end

  def test_should_parse_reverse
    @marker.parse 'THETREEHASACAT.'
    assert !@marker.match?
    assert_equal @marker.input, @marker.output
  end

  def test_should_parse_touching
    @marker.parse 'THECATTREEHASMANYBIRDS.'
    assert !@marker.match?
    assert_equal @marker.input, @marker.output
  end

  def test_should_parse_multiple_patterns
    @marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
    assert @marker.match?
    assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
  end

  def test_should_mark_multiple_matches_at_same_place
    @marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
    assert @marker.match?
    assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
  end

  def test_should_mark_all_possible_matches
    @marker.parse 'CATFOOTREEFOOCATFOOTREE.'
    assert @marker.match?
    assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
  end

  def test_should_accept_input
    @marker.parse 'CATINTREE'
    assert @marker.match?
    assert_equal 'CATINTREE', @marker.input
    @marker.parse 'FOOBAR'
    assert !@marker.match?
    assert_equal 'FOOBAR', @marker.input
  end

  def test_should_only_accept_valid_patterns
    assert_raise ArgumentError do PatternMarker.new([])                                end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'])                     end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
    assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ'])             end
    assert_nothing_raised      do PatternMarker.new([['FOO','BAR']])                   end
  end
end

测试输出:

Loaded suite pattern
Started
........
Finished in 0.003910 seconds.

8 tests, 21 assertions, 0 failures, 0 errors, 0 skips

Test run options: --seed 31173

编辑:添加测试并简化了一些代码

答案 3 :(得分:1)

这是部分答案。它符合您的所有要求,除了最后一个,没有一个简单的解决方案。我会留下那个让你弄清楚: - )

我选择了基于规则的方法而不是正则表达式。我在之前的类似项目中发现,简单的基于规则的解析器更容易维护,可移植,并且通常比正则表达式更快。我没有在这里使用任何真正的Ruby特定功能,所以它应该可以轻松移植到Python或Perl。它甚至可以毫不费力地移植到C语言。

patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
]

newlines = []

START_TAG_LENGTH = 9
END_TAG_LENGTH = 7

lines.each do |line|

  newline = line.dup
  before = {}
  n = 1

  patterns.each do |pair|

    a = 0

    matches = [[], []]
    len = pair[0].length

    pair.each do |pattern|
      b = 0
      while (c = line.index(pattern, b))
        matches[a] << c
        b = c + 1
      end
      break if b == 0 && a > 0
      a += 1
    end

    matches[0].each_with_index do |d, f|
      bd = 0; be = 0
      e = matches[1][f]
      next if (d > e) || (d + len == e)
      d = d + len
      before.each { |g, h| bd += h if g <= d }
      newline.insert(d + bd, "\\start{#{n}}")
      before[d] ||= 0
      before[d] += START_TAG_LENGTH
      before.each { |g, h| be += h if g <= e }
      newline.insert(e + be, "\\end{#{n}}")
      before[e] ||= 0
      before[e] += END_TAG_LENGTH
    end

    n += 1

  end

  newlines << newline

end

puts newlines

输出:

THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORTTREES.

注意最后一个失败了。不过,这应该会给你一个良好的开端。如果你需要帮助搞清楚一些代码的作用,请不要犹豫。

另一方面,只是好奇,你用它做什么?

答案 4 :(得分:1)

这是一个完全用bash(没有外部命令)。不是太难!它期望stdin上的输入行。

#/bin/bash

words=("CAT TREE" "LION FORREST" "OWL WATERFALL")

function doit () {
  if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
    line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
    (( count += 1 ))
    doit
  elif [[ "$line" =~ $alt_w1 ]]; then
    line=${line//$alt_w1/$word1}
    [[ "$line" =~ (.*)$word2(.*) ]]
    line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
    doit
  elif [[ "$line" =~ $alt_w2 ]]; then
    line=${line//$alt_w2/$word2}
  fi
}

while read line; do
  count=1
  for pair in "${words[@]}"; do
    word1=${pair% *}
    word2=${pair#* }
    alt_w1="${word1:0:1}XYZZYX${word1:1}"
    alt_w2="${word2:0:1}XYZZYX${word2:1}"
    doit
  done
  echo "$line"
done

假设:

  1. 文本永远不会包含“XYZZYX”(字符串可以更改)。
  2. 单词永远不会包含正则表达式中使用的字符。
    • e.g。 . * [ ] ^ $ +
    • (对于那些排队的人来说没问题。)
  3. 这些字总是至少两个字符。
  4. 这些词永远不会是你正在寻找的其他词的子串。
    • e.g。 catcattle
    • 实际上,这可能有用,但结果会让人感到困惑。

答案 5 :(得分:1)

这是我在非常流行的Python中的解决方案。

patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']

strings = [u'THEREISACATINTHETREE.',
           u'THETREEHASACAT.',
           u'THECATTREEHASMANYBIRDS.',
           u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
           u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
           u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
           u'ACATONATREEBUTNOCATTREE.' ]

def findMatch(needles, haystack, label):
    needles = needles.split(',')
    matches = haystack.split(needles[0])

    if len(matches) > 1:
        submatches = matches[1].split(needles[1])

        if len(submatches) > 1:
            return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])

    return False

for s in strings:
    i = 0
    res = s
    for pat in patterns:
        i = i + 1
        temp = findMatch(pat, res, str(i))

        if (temp):
            res = temp

    print ('searching in '+s+' yields '+res).encode('utf-8')

答案 6 :(得分:0)

这是我的PERL方法。这很快又很脏。

如果我使用Marpa解析不是正则表达式,那可能会好得多。

无论如何,它完成了这项工作。

use strict;
use Test::More;
use Data::Dumper;

# patterns to search for
my @patterns = (
    'CAT,TREE',
    'LION,FOREST',
    'OWL,WATERFALL',
);
#lines
my @lines = qw(
THEREISACATINTHETREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
);


my @expected_output = (
'THEREISACAT\start{1}INTHE\end{1}TREE.',
'Does not Match',
'Does not Match',
'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
);

#is(check_line($lines[0]),$expected_output[0]);die;

my $no=0;
for(my $i=0;$i<scalar(@lines );$i++){   
    is(check_line($lines[$i]),$expected_output[$i]);
    $no++;
}
done_testing( $no );

sub check_line{
    my $in      = shift;
    my $out = '';
    my $match = 1;
    foreach my $pattern_line (@patterns){
        my ($first,$second) = split(/,/,$pattern_line);
        #warn "$first,$second,$in\n";
        if ($in !~ m#$first.+?$second#is){
            next;
        }
        #matched    

        while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
            $match++;
            #warn "Found match: $match\n";
        }
        $in =~ s#_SECOND_#$second#gis;
        #$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
        my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;

        my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
        #print Dumper($in,$start,$end,$stmp);
        $in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;


    }
    return 'Does not Match' if $match ==1;
    $out = $in;
    return $out;
}