Question

我试图计算文档中12个字窗口内3个单词短语出现的次数，但难点在于我搜索的关键字可以在整个范围内传播窗口。

例如：

我想找到短语＆＃34;期待恶劣天气＆＃34;在12个单词的短语中，只要包含3个单词的总短语不超过12个单词，就可以在3个所需单词之间插入其他单词。

可行的短语：

我希望天气不好。
他们期待恶劣多风的天气。
我希望，虽然没有人证实这一点，但是天气恶劣了方式。

我一直在努力弄清楚如何做到这一点。我知道如何计算可能存在差距的双字短语的出现次数。例如，如果我计算频率＆＃34;期望＆＃34;和＆＃34;天气＆＃34;发生在一个12字的短语中，我可以这样做：

$mycount =()= $text =~ /\b(?:expect\W+(?:\w+\W+){0,10}?weather)\b/gi;

然而，当我想用3个单词做这个时，它并不那么简单，因为我最终得到了两个必须相加的间隙，这样我的窗口就不会超过12个单词。理想情况下，我可以做类似的事情：

$mycount =()= $text =~ /\b(?:expect\W+(?:\w+\W+){0,$Gap1}?bad\W+(?:\w+\W+){0,$Gap2}?weather)\b/gi;

$ Gap2 = 9 - $ Gap1，但我认为没有办法做到这一点。

我还想过创建一个循环，以便在循环的一次迭代中，$ Gap1 = 0和$ Gap2 = 9，在第二次迭代中$ Gap1 = 1和$ Gap2 = 8等，然后添加计数所有的循环。但是，这样做会使该短语的一些实例重复计算。

我不知所措。有没有人有任何想法？我无法在任何地方找到相关的例子。

Answer 1

注意这篇文章解决了在窗口中查找分散的单词的问题。它没有考虑更常见的文本解析或语言分析问题。

下面的代码搜索第一个单词，然后继续使用另外两个正则表达式。它在那里逐字扫描文本并保留一个计数器，因此它可以停在12个单词。它使用pos来控制在检查窗口后它应该继续的位置。

一旦发现，这个12长的窗口就会从单词expect开始，正如评论中所阐明的那样。搜索从完成的短语之后继续搜索，用于下一个短语。

如果在接下来的11个单词中找不到该短语，则引擎将返回expect之后的位置继续进行搜索（因为在选中的11个单词中可能有另一个expect）

use warnings;
use strict;
use feature 'say';

my $s = q(I expect, although no one confirmed, that bad weather is on the way.);
$s   .= q(  Expect that we cannot expect to escape the bad, bad weather.);

my $word_range = 12;
my ($w1, $w2, $w3) = qw(expect bad weather);

FIRST_WORD: while ($s =~ /\b($w1)\b/gi) {
    #say "SEARCH, at ", pos $s;
    my ($one, $pos_one) = ($1, pos $s);

    my ($two, $three, $cnt);

    while ($s =~ /(\w+)/g) {
        my $word = $1; 
        #say "\t$word  ... (at ", pos $s, ")";

        $two = $1  if $word =~ /\b($w2)\b/i; 

        if ( $two and (($three) = $word =~ /\b($w3)\b/i) ) { 
            say "$one + $two + $three  (pos ", pos $s, ')';
            next FIRST_WORD;
        }
        last if ++$cnt == $word_range-1;  # failed (these 11 + 'expect') 
    }
    pos $s = $pos_one;         # return to position in string after 'expect'
}

请注意，无法在循环条件内分配匹配（对于$one），因为这会将匹配放在列表上下文中，从而干扰/g和pos所需的行为。

注释掉的打印件可用于跟踪操作。现在就是这个版画

expect + bad + weather  (pos 53)
Expect + bad + weather  (pos 128)

我扩展字符串以测试多次出现的短语。可以通过削弱关键字并跟踪搜索中的位置来测试匹配失败的操作。

短语内部可能的额外关键字，如第二句中所示，将被忽略，如果存在，则接受该短语，因为这是未指定但在问题中隐含的。这很容易改变。

如果短语中有更多的单词，他们将在内部while循环中寻找所有单词，就像现在的最后两个单词一样，通过顺序匹配它们（要求每个单词表示所有前面的单词）已被发现）。仅需要外部while循环来启动窗口。

在窗口扫描失败后，外部while继续从窗口开始位置搜索expect，从而再次扫描相同的11个字。

通过在窗口扫描期间检查expect，可以减少对文本的重复搜索。然后使用内部while

从该位置重新扫描

# First sentence shortened and now does not contain the phrase
my $s = q(I expect, although no one confirmed, that bad expect.);
$s   .= q( Expect that we cannot expect to escape the bad, bad weather.);    
...
FIRST_WORD: while ($s =~ /\b($w1)\b/gi) {
    my ($one, $pos_one) = ($1, pos $s);

    my ($two, $three, $cnt, $pos_one_new);

    while ($s =~ /(\w+)/g) {
        my $word = $1;
        #say "\t$word  ... (at ", pos $s, ")";

        $pos_one_new = pos $s
            if not $pos_one_new and $word =~ /\b$w1\b/i;

        $two = $1  if $word =~ /\b($w2)\b/i;

        if ( $two and (($three) = $word =~ /\b($w3)\b/i) ) {
            say "$one + $two + $three  (pos ", pos $s, ')';
            next FIRST_WORD;
        } 

        if (++$cnt == $word_range-1) {
            last  if not $pos_one_new;

            #say "Scan window anew from $pos_one_new";
            pos $s   = $pos_one_new;
            $pos_one = $pos_one_new;
            $pos_one_new = 0;
            $two = $three = '';
            $cnt = 0;
        }
    }
    pos $s = $pos_one;
}

打印

expect + bad + weather  (pos 113)

请注意，窗口中的{em>第一次出现expect。

Answer 2

由于您提到处理文档，我假设您正在使用一长串句子。所以你可以：

我不确定为什么我总是对人们不好。天气不是指示器。

我认为这不适合标记为目标短语的出现＆＃34;期待恶劣的天气＆＃34;。

您已经获得了一个纯粹的正则表达式的答案。您可以通过拆分句子轻松修复它所具有的交叉句短语检测错误，就像我在这里做的那样。尽管如此，我还是想到了另一种思考这个问题的方法。

关键概念是标记化和标准化。

首先，我们将语料库转换为句子列表。这是一个标记化级别。

秒我们将每个句子变成一串小写单词，删除标点符号（撇号除外）。标记化级别2和标准化。

现在你所要做的就是筛选所有成堆的代币，看看是否有任何包含目标代币。

我通过循环语料库文本寻找我们匹配目标的第一个单词的地方，以非常懒惰的方式处理回溯。如果发生这种情况，我会从语料库中获取最大数量的单词，并检查目标列表是否包含在该列表中。这提供了一个很好的回溯行为，没有所有的簿记。

use strict;
use warnings;
use feature 'say';
use Lingua::Sentence;

my $doc = "I am unsure why I always expect bad from people. Weather isn't an indicator. My mood is fine whether it is sunny or I expect to see some bad weather.";

my @targets = (
    [qw/ expect bad weather /],
    [qw/ my mood is bad /],
);
my $max_phrase_length = 12;

my $splitter = Lingua::Sentence->new('en');
my @sentences = $splitter->split_array( $doc );

my %counter;

for my $sentence ( @sentences ) {
    say "Checking sentence:  $sentence";
    my @words = map lc,                  # Normalize to lowercase
                map /(['\w]*)/,          # get rid of punctuation
                split /\s+/, $sentence;  # Break sentence into words

    for my $target ( @targets ) {
        say "    Checking target:  @$target";

        for my $i (0..$#words ) { 
            my $word = @words[$i];
            say "        Checking $word";

            next if $word ne $target->[0];

            my $first_word = $i; 
            my $last_word = $i + $max_phrase_length -1; 
            $last_word = $#words if $last_word > $#words;

            if ( has_phrase( $target, [ @words[$first_word..$last_word] ] ) ) { 
                say "BINGO!  @$target";
                $counter{ "@$target" }++;
            }
        }
    }
}

use Data::Dumper;

print Dumper \%counter;


sub has_phrase {
    my ( $target, $text ) = @_;
    return if @$target > $text;

    my $match_idx = 0;
    for my $idx ( 0..$#$text ) {
        if ($target->[$match_idx] eq $text->[$idx]) {
            $match_idx++;
            return 1 if $match_idx eq @$target;
        }
    }

    return;
}

Answer 3

你的要求对我来说有点模糊。就像我不知道你是否想要接受任何单词序列，然后计算“期望。*糟糕。*天气”，或者如果你只想取12个单词而忽略其余部分，或者你想要滑动一次一个字，一次不超过12个字。

我想我会简化它：我拿全线输入;我抛出任何不期望，不好或天气的词;然后我计算其后“预期恶劣天气”发生的次数。如果有什么说“期待恶劣的恶劣天气”这不是一场比赛。我确信你可以用更准确的要求修改它，因为你比我更了解它们。

while(<>){
$_=lc;
@w=split(/\W+/);
@w=map {
  if    (/expect/)      {1}
  elsif (/bad/)         {2}
  elsif (/weather/)     {3}
  else                  {0}
  } @w;
$_ = join("", @w);
print;
@w=grep {+$_>0} @w;
$_ = join("", @w);
print "=>$_";
@r=/123/g;
print "=".scalar(@r)."\n";
}

示例：

hi! Expect really bad weather man.
010230=>123=1
hi! Expect really bad weather man.hi! Expect really bad weather man.hi! Expect really bad weather man.
010230010230010230=>123123123=3
Expect expect bad weather, expect bad bad bad weather, expect bad expect weather.
1123122231213=>1123122231213=1

您也可以对此进行排序，但我认为scalar(/123/g)表示与@r=/123/g;scalar @r;不同的内容，因此我提出了scalar(@_=/123/g)。

$ perl -nE '$_=lc;$_=join("",grep{+$_>0}map{if(/expect/){1}elsif(/bad/){2}elsif(/weather/){3}else{0}}split(/\W+/));say scalar(@_=/123/g)."\n";'
hi! Expect really bad weather man.
1

hi! Expect really bad weather man. hi! Expect really bad weather man.
2

Expect Sad and Bad Weather today. Maybe Expect bad weather tomorrow too, because scalar is not helping.
2

Perl：如何计算N字窗口{3}中出现3字短语（带间隙）的次数

3 个答案: