我需要搜索并标记在一条线上某处分割的图案。下面是一个简短的样本模式列表,它放在一个单独的文件中,例如:
CAT,TREE
LION,FOREST
OWL,WATERFALL
如果第2列中的项目出现在与第1列中的项目相同的行之后,则会出现匹配。例如:
THEREISACATINTHETREE. (matches)
如果第2列中的项目首先出现在该行上,则不会出现匹配,例如:
THETREEHASACAT. (does not match)
此外,如果第1列和第2列中的项目触摸,则不会出现匹配,例如:
THECATTREEHASMANYBIRDS. (does not match)
找到任何匹配后,我需要将其标记为\start{n}
(出现在第1列项目之后)和\end{n}
(出现在第2列项目之前),其中n
为一个简单的计数器,可以随时增加任何匹配。 E.g:
THEREISACAT\start{1}INTHE\end{1}TREE.
这是一个更复杂的例子:
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
这变为:
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
有时在同一个地方有多个匹配项:
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
这变为:
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
如何找到这些匹配并以这种方式标记它们?
答案 0 :(得分:6)
以下是一种Perl方法:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# couples of patterns to search for
my @patterns = (
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL'],
);
# loop over all sentences
while (my $line = <DATA>) {
chomp $line; #remove linefeed
my $count = 1; #counter of start/end
foreach my $pats (@patterns) {
#$p1=first pattern, $p2=second
my ($p1, $p2) = @$pats;
#split on patterns, keep them, remove empty
my @s = grep {$_} split /($p1|$p2)/, $line;
#$start=position where to put the \start
#$end=position where to pt the \end
my ($start, $end) = (undef, undef);
#loop on all elements given by split
for my $i (0 .. $#s) {
# current element
my $cur = $s[$i];
#if = first pattern, keep its position in the array
if ($cur eq $p1) {
$start = $i;
}
#if = second pattern, keep its position in the array
if ($cur eq $p2) {
$end = $i;
}
#if both are defined and second pattern after first pattern
# insert \start and \end
if (defined($start) && defined($end) && $end > $start + 1) {
$s[$start] .= "\\start{$count}";
$s[$end] = "\\end{$count}" . $s[$end];
undef $end;
$count++;
}
}
# recompose the line
$line = join '', @s;
}
say $line;
}
__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE
<强>输出:强>
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACAT\start{1}INTHE\end{1}TREE.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT\start{1}...\end{1}TREE...CAT\start{2}...\end{2}TREE
答案 1 :(得分:5)
检查一下(Ruby):
#!/usr/bin/env ruby
patterns = [
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL']
]
lines = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
'CAT...TREE...CAT...TREE'
]
lines.each do |line|
puts line
matches = Hash.new{|h,e| h[e] = [] }
match_indices = []
patterns.each do |first,second|
offset = 0
while new_offset = line.index(first,offset) do
# map second element of the pattern to minimal position it might be matched
matches[second] << new_offset + first.size + 1
offset = new_offset + 1
end
end
global_counter = 1
matches.each do |second,offsets|
offsets.each do |offset|
second_offset = offset
while new_offset = line.index(second,second_offset) do
# register the end index of the first pattern and
# the start index of the second pattern with the global match count
match_indices << [offset-1,new_offset,global_counter]
second_offset = new_offset + 1
global_counter += 1
end
end
end
indices = Hash.new{|h,e| h[e] = ""}
match_indices.each do |first,second,global_counter|
# build the insertion string for the string positions the
# start and end tags should be placed in
indices[first] << "\\start{#{global_counter}}"
indices[second] << "\\end{#{global_counter}}"
end
inserted_length = 0
indices.sort_by{|k,v| k}.each do |position,insert|
# insert the tags at their positions
line.insert(position + inserted_length,insert)
inserted_length += insert.size
end
puts line
end
结果
THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE
修改
我插入了一些评论并澄清了一些变量。
答案 2 :(得分:2)
首先,您必须从模式中找到所有出现的开始和结束字符串。然后你需要找出哪些标签组合在一起(如果结束字符串位于起始字符串之前或者位于相同位置并因此接触,则它们不适合)。然后你可以生成你的标签并插入你的输出字符串。请注意,您需要将插入的字符数添加到您的位置,因为插入标记时字符串的长度会发生变化。此外,您必须在插入之前按位置对标记进行排序,否则计算起来会变得非常复杂,您必须在多远的位置移动位置。这是Ruby中的一个简短示例:
patterns = [['CAT','TREE'], ['LION','FOREST'], ['OWL','WATERFALL']]
strings = ['THEREISACATINTHETREE.', 'THETREEHASACAT.', 'THECATTREEHASMANYBIRDS.', 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.', 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.', 'ACATONATREEANDANOTHERCATONANOTHERTREE.', 'ACATONATREEBUTNOCATTREE.']
strings.each do |string|
matches = {}; tags = []
counter = shift = 0
output = string.dup
patterns.each do |sstr,estr| # loop through all patterns
posa = []; posb = []; #
string.scan(sstr){posa << $~.end(0)} # remember found positions and
string.scan(estr){posb << $~.begin(0)} # find all valid combinations (next line)
matches[[sstr,estr]] = posa.product(posb).reject{|s,e|s>=e}
end
matches.each do |pat,pos| # loop through all matches
pos.each do |s,e| #
tags << [s,"\\start{#{counter += 1}}"] # generate and remember \start{}
tags << [e,"\\end{#{counter}}"] # and \end{} tags
end
end
tags.sort.each do |pos,tag| # sort and loop through tags
output.insert(pos+shift,tag) # insert tag and increment
shift += tag.chars.count # shift by num. of inserted chars
end
puts string, output # print result
end
它不漂亮,但它符合您的所有要求。我认为下一个示例更具可读性和可重用性,并且它被实现为具有相应单元测试的Ruby类,以确保其工作:
class PatternMarker
require 'english'
attr_reader :input, :output, :matches
def initialize patterns
@patterns = patterns
raise ArgumentError, 'no patterns given' unless @patterns.any?
@patterns.each do |p|
raise ArgumentError, 'every pattern must have exactly two strings' unless p.count == 2
end
end
def parse input
@input = input.dup
match_patterns
generate_output
self
end
def match?
@matches.any?
end
private
def match_patterns
@matches = {}
@patterns.each do |start_str,end_str|
pos = { :start => [], :end => [] }
@input.scan(start_str){ pos[:start] << $LAST_MATCH_INFO.end(0) }
@input.scan(end_str ){ pos[:end] << $LAST_MATCH_INFO.begin(0) }
@matches[[start_str,end_str]] = pos[:start].product(pos[:end])
@matches[[start_str,end_str]].reject!{ |s,e| e <= s }
@matches.reject!{ |p,pos| pos.none? }
end
end
def generate_output
tags = []
counter = shift = 0
@output = @input.dup
@matches.each do |pattern,positions|
positions.each do |s,e|
counter += 1
tags << [s, "\\start{#{counter}}"]
tags << [e, "\\end{#{counter}}" ]
end
end
tags.sort!.each do |position,tag|
@output.insert(position+shift,tag)
shift += tag.chars.count
end
end
end
行动中:
patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
strings = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
'ACATONATREEANDANOTHERCATONANOTHERTREE.',
'ACATONATREEBUTNOCATTREE.'
]
marker = PatternMarker.new(patterns)
strings.each do |string|
marker.parse(string)
puts "input: #{marker.input}"
if marker.match?
puts "output: #{marker.output}"
else
puts "(does not match)"
end
puts
end
输出:
input: THEREISACATINTHETREE.
output: THEREISACAT\start{1}INTHE\end{1}TREE.
input: THETREEHASACAT.
(does not match)
input: THECATTREEHASMANYBIRDS.
(does not match)
input: THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
output: THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
input: THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
output: THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
input: ACATONATREEANDANOTHERCATONANOTHERTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEANDANOTHERCAT\start{3}ONANOTHER\end{2}\end{3}TREE.
input: ACATONATREEBUTNOCATTREE.
output: ACAT\start{1}\start{2}ONA\end{1}TREEBUTNOCAT\end{2}TREE.
测试:
require 'test/unit'
class TestPatternMarker < Test::Unit::TestCase
def setup
@patterns = [
['CAT' , 'TREE' ],
['LION', 'FOREST' ],
['OWL' , 'WATERFALL']
]
@marker = PatternMarker.new(@patterns)
end
def test_should_parse_simple
@marker.parse 'THEREISACATINTHETREE.'
assert @marker.match?
assert_equal 'THEREISACAT\start{1}INTHE\end{1}TREE.', @marker.output
end
def test_should_parse_reverse
@marker.parse 'THETREEHASACAT.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_touching
@marker.parse 'THECATTREEHASMANYBIRDS.'
assert !@marker.match?
assert_equal @marker.input, @marker.output
end
def test_should_parse_multiple_patterns
@marker.parse 'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.'
assert @marker.match?
assert_equal 'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.', @marker.output
end
def test_should_mark_multiple_matches_at_same_place
@marker.parse 'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
assert @marker.match?
assert_equal 'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.', @marker.output
end
def test_should_mark_all_possible_matches
@marker.parse 'CATFOOTREEFOOCATFOOTREE.'
assert @marker.match?
assert_equal 'CAT\start{1}\start{2}FOO\end{1}TREEFOOCAT\start{3}FOO\end{2}\end{3}TREE.', @marker.output
end
def test_should_accept_input
@marker.parse 'CATINTREE'
assert @marker.match?
assert_equal 'CATINTREE', @marker.input
@marker.parse 'FOOBAR'
assert !@marker.match?
assert_equal 'FOOBAR', @marker.input
end
def test_should_only_accept_valid_patterns
assert_raise ArgumentError do PatternMarker.new([]) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['FOO','BAR','BAZ']) end
assert_raise ArgumentError do PatternMarker.new(['FOO','BAR'],['BAZ']) end
assert_nothing_raised do PatternMarker.new([['FOO','BAR']]) end
end
end
测试输出:
Loaded suite pattern
Started
........
Finished in 0.003910 seconds.
8 tests, 21 assertions, 0 failures, 0 errors, 0 skips
Test run options: --seed 31173
编辑:添加测试并简化了一些代码
答案 3 :(得分:1)
这是部分答案。它符合您的所有要求,除了最后一个,没有一个简单的解决方案。我会留下那个让你弄清楚: - )
我选择了基于规则的方法而不是正则表达式。我在之前的类似项目中发现,简单的基于规则的解析器更容易维护,可移植,并且通常比正则表达式更快。我没有在这里使用任何真正的Ruby特定功能,所以它应该可以轻松移植到Python或Perl。它甚至可以毫不费力地移植到C语言。
patterns = [
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL']
]
lines = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.'
]
newlines = []
START_TAG_LENGTH = 9
END_TAG_LENGTH = 7
lines.each do |line|
newline = line.dup
before = {}
n = 1
patterns.each do |pair|
a = 0
matches = [[], []]
len = pair[0].length
pair.each do |pattern|
b = 0
while (c = line.index(pattern, b))
matches[a] << c
b = c + 1
end
break if b == 0 && a > 0
a += 1
end
matches[0].each_with_index do |d, f|
bd = 0; be = 0
e = matches[1][f]
next if (d > e) || (d + len == e)
d = d + len
before.each { |g, h| bd += h if g <= d }
newline.insert(d + bd, "\\start{#{n}}")
before[d] ||= 0
before[d] += START_TAG_LENGTH
before.each { |g, h| be += h if g <= e }
newline.insert(e + be, "\\end{#{n}}")
before[e] ||= 0
before[e] += END_TAG_LENGTH
end
n += 1
end
newlines << newline
end
puts newlines
输出:
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORTTREES.
注意最后一个失败了。不过,这应该会给你一个良好的开端。如果你需要帮助搞清楚一些代码的作用,请不要犹豫。
另一方面,只是好奇,你用它做什么?
答案 4 :(得分:1)
这是一个完全用bash(没有外部命令)。不是太难!它期望stdin上的输入行。
#/bin/bash
words=("CAT TREE" "LION FORREST" "OWL WATERFALL")
function doit () {
if [[ "$line" =~ (.*)$word1(.*)$word2(.*) ]]; then
line="${BASH_REMATCH[1]}$alt_w1\\start{$count}${BASH_REMATCH[2]}$word2\\end{$count}${BASH_REMATCH[3]}"
(( count += 1 ))
doit
elif [[ "$line" =~ $alt_w1 ]]; then
line=${line//$alt_w1/$word1}
[[ "$line" =~ (.*)$word2(.*) ]]
line="${BASH_REMATCH[1]}$alt_w2${BASH_REMATCH[2]}"
doit
elif [[ "$line" =~ $alt_w2 ]]; then
line=${line//$alt_w2/$word2}
fi
}
while read line; do
count=1
for pair in "${words[@]}"; do
word1=${pair% *}
word2=${pair#* }
alt_w1="${word1:0:1}XYZZYX${word1:1}"
alt_w2="${word2:0:1}XYZZYX${word2:1}"
doit
done
echo "$line"
done
假设:
. * [ ] ^ $ +
cat
和cattle
。答案 5 :(得分:1)
这是我在非常流行的Python中的解决方案。
patterns = [u'CAT,TREE', u'LION,FOREST', u'OWL,WATERFALL']
strings = [u'THEREISACATINTHETREE.',
u'THETREEHASACAT.',
u'THECATTREEHASMANYBIRDS.',
u'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
u'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
u'ACATONATREEANDANOTHERCATONANOTHERTREE.',
u'ACATONATREEBUTNOCATTREE.' ]
def findMatch(needles, haystack, label):
needles = needles.split(',')
matches = haystack.split(needles[0])
if len(matches) > 1:
submatches = matches[1].split(needles[1])
if len(submatches) > 1:
return u''.join([matches[0], needles[0], u'\\start{'+label+'}', submatches[0], u'\\end{'+label+'}', needles[1], submatches[1]])
return False
for s in strings:
i = 0
res = s
for pat in patterns:
i = i + 1
temp = findMatch(pat, res, str(i))
if (temp):
res = temp
print ('searching in '+s+' yields '+res).encode('utf-8')
答案 6 :(得分:0)
这是我的PERL方法。这很快又很脏。
如果我使用Marpa解析不是正则表达式,那可能会好得多。
无论如何,它完成了这项工作。
use strict;
use Test::More;
use Data::Dumper;
# patterns to search for
my @patterns = (
'CAT,TREE',
'LION,FOREST',
'OWL,WATERFALL',
);
#lines
my @lines = qw(
THEREISACATINTHETREE.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREESORBIGTREES.
);
my @expected_output = (
'THEREISACAT\start{1}INTHE\end{1}TREE.',
'Does not Match',
'Does not Match',
'THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.',
'THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.',
'THECAT\start{1}\start{2}\start{3}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREESORBIG\end{3}TREES.',
);
#is(check_line($lines[0]),$expected_output[0]);die;
my $no=0;
for(my $i=0;$i<scalar(@lines );$i++){
is(check_line($lines[$i]),$expected_output[$i]);
$no++;
}
done_testing( $no );
sub check_line{
my $in = shift;
my $out = '';
my $match = 1;
foreach my $pattern_line (@patterns){
my ($first,$second) = split(/,/,$pattern_line);
#warn "$first,$second,$in\n";
if ($in !~ m#$first.+?$second#is){
next;
}
#matched
while ($in =~ s#($first)(.+?)($second)#$1\\start\{$match\}$2\\end\{$match\}_SECOND_#is){
$match++;
#warn "Found match: $match\n";
}
$in =~ s#_SECOND_#$second#gis;
#$in =~ s#\\start\{(\d+)\}\\start\{(\d+)\}#\\start\{$2\}\\start\{$1\}#gis;
my ($end,$start) = $in =~ m#\\start\{(\d+)\}(?:\\start\{(\d+)\})+#gis;
my $stmp = join("",map {"\\start\{$_\}"} ($start..$end));
#print Dumper($in,$start,$end,$stmp);
$in =~ s#\\start\{($end)\}.*?\\start\{($start)\}#$stmp#is;
}
return 'Does not Match' if $match ==1;
$out = $in;
return $out;
}