Perl-REGEXP如何在没有替代模式的情况下匹配单词中的子字符串?

时间:2018-10-11 15:25:42

标签: regex perl

下午好,

我有一串空白的单词。我需要从该字符串中找到与字母数字模式,部分或整个单词匹配的单词。 我需要仅由字母数字字符组成的单词。

为了使我的目的更清楚,我有以下字符串:

'foo bar quux foofoo foobar fooquux barfoo barbar barquux'。
'quuxfoo quuxbar quuxquux [foo](foo){foo} foofoo barfoo'。
'quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo'

我想查找所有带有'foo'的单词(每个单词仅一次),而不是带有特殊字符(非字母)的单词,例如“ [foo]”,“ {foo}” ...

我在Perl中使用以下代码完成了此操作:

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';
my @m = ($s=~/(\w+foo|foo\w+|^foo|foo$)/g) ;
say "@m";
say "Number of sub-strings matching the pattern: ", scalar @m;
print( sprintf("%02d: ",$_),
       ($s=~/(\w+foo|foo\w+|^foo|foo$)/g)[$_],
       qq(\n) )
    for (0..@m-1);

我得到想要的结果:

foo foofoo foobar fooquux barfoo quuxfoo foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo
Number of sub-strings matching the pattern: 15 
00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

但是,如果我需要(而且我会)添加更多模式以在更复杂的字符串中进行搜索,那么它很快就会变得混乱,并且我对一系列替代模式('|')感到困惑。

是否有人可以帮助我编写一个更短/更干净的正则表达式正则表达式,以一种可以用单个模式编写的方式来分隔“ foo”(或任何其他)单词/子单词?

谢谢。

GM

W7 / 64上的草莓5.022,但我认为它对5.016甚至5.008以上的任何Perl都是通用的;


我发现 dawg (还有 steffen )的解决方案也很适合我。 grep并不是最易读的,它更符合我的Perl的水平,但是我认为,由于基于纯正则表达式,所以将来能够处理个单词限制的单词添加 处理

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

我想在这里写下我对它的了解,以便在我打算扩展它以满足实际需要之前,如果我错了,您可以纠正我。

(?:         # You start a non capturing group.
(?<=        # You start a lookbehind (so non capturing BY NATURE, am I right ?, because
            # if not, as it is being enclosed in round-brackets '()' it restarts to be
            # capturing even inside a non capturing group, isn't it?)
 \h         # In the lookbehind you look for an horizontal space (could \s have been used
            # there?)
 ^          # in the non capturing group but outside of the lookbehind you look for the
            # start of string anchor. Must not be present in the lookbehind group because
            # it requires a same length pattern size and ^ has length==0 while \h is
            # non zero.
\w*foo\w*   # You look for foo within an alphanum word. No pb to have '*' rather than '+'
            # because your left (and right, that we'll see it down) bound has been well
            # restricted.
(?=         # You start a lookforward pattern (non capturing by nature here again, right?),
            # to look for:
\h or $     # horiz space or end of string anchor. However the lookaround size is
            # different here as $ is still 0 length (as ^ anchor) and \h still non
            # zero. "AND YET IT MOVES" (I tested your regexp and it worked) because
            # only the lookbehind has the 'same-size' pattern restriction, right?

谢谢大家的帮助,在最后一点之后,我将不再为我的小问题烦扰您,并认为我的问题已完全解决。 G。

3 个答案:

答案 0 :(得分:4)

这取决于:如果您想从foobar获得(foobar),这很容易。您只需将foo与可选单词字符前后匹配,然后在两侧都将单词边界\b(可以是输入的开头或某些非单词字符):

my @m = ($s=~/(\b\w*foo\w*\b)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(\b\w*foo\w*\b)/g)[$_],
    qq(\n) )
for (0..@m-1);

输出:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foo
07: foo
08: foo
09: foofoo
10: barfoo
11: quuxfoo
12: foo2foo
13: foo2bar
14: foo2quux
15: foo2foo
16: bar2foo
17: quux2foo

如果没有,那就更困难了。在这里,我要匹配输入的开头或空格,然后匹配foo,并用可选的文字字符将其包围,然后我们需要一个(零长度)断言,该断言需要一个空格或输入结束:

my @m = ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g)[$_],
    qq(\n) )
for (0..@m-1);

输出:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

答案 1 :(得分:3)

您可以拆分字符串并过滤数组:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @res = grep {/foo/ && !/\W/}  split /\s/, $s;

print join(" ", @res);

答案 2 :(得分:2)

也许首先过滤掉不需要的单词,然后对过滤出的单词使用grep:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @words = ( $s=~/(?:(?<=\h)|^)(\w+)(?=\h|$)/g );

my @foos = grep(/foo/, @words);

while (my ($i, $v) = each @foos) {
    printf "%02d: %s\n", $i,$v;
}

打印:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

或者,您可以将过滤条件结合到由水平空格分隔的单词列表上,并测试结果单词是否全部为字母数字:

@foos=grep {/foo/ && /^\w+$/} split /\h/, $s;  # same result

或者,

@foos=grep {/^\w*foo\w*$/} split /\h/, $s; 

或者,在single regex中:

@foos=($s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g);

根据评论的要求,带有:

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

唯一棘手的部分是(?:(?<=\h)|^)。由于(?<=\h|^)的宽度为零,而^的宽度为零,因此在Perl中进行诸如\h之类的非固定宽度回溯是非法的。 (有趣的是,正则表达式(?<=\h|^)在PCRE库中是合法的。)因此(?:(?<=\h)|^)将两个断言分为一组。