Perl段落n-gram

时间:2010-08-18 20:58:52

标签: perl n-gram

假设我有一段文字:

$body = 'the quick brown fox jumps over the lazy dog';

我希望将该句子放入'关键字'的哈希值,但我想允许使用多词关键词;我有以下内容来获得单字关键字:

$words{$_}++ for $body =~ m/(\w+)/g;

完成此操作后,我的哈希值如下所示:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

下一步,我可以获得双字关键字,如下:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

但这只会得到每一个“其他”对;看起来像这样:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

我还需要一个字偏移:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

有比这更简单的方法吗?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

5 个答案:

答案 0 :(得分:5)

虽然所描述的任务可能对手工编码感兴趣, 使用处理n-gram的现有CPAN模块不是更好吗?看起来Text::Ngrams(与Text::Ngram相对)可以处理基于单词的n-gram分析。

答案 1 :(得分:2)

我会使用look-ahead收集除第一个单词之外的所有内容。这样,位置会自动正确前进:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

如果你想坚持使用单个空格而不是\s+,你可以稍微简化一下(如果你这样做,不要忘记删除/x修饰符),因为你可以收集任何$2中的单词数,而不是每个单词使用一个组。

答案 2 :(得分:2)

您可以使用lookaheads做一些有点时髦的事情:

如果我这样做:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

该表达式表示要向前看两个单词(并捕获它们),但消耗1个。

我明白了:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

似乎我可以通过输入count的变量来概括它:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

答案 3 :(得分:1)

使用pos运算符

  

pos SCALAR

     

返回最后m//g次搜索为相关变量停止的位置的偏移量(未指定变量时使用$_)。

@-特殊数组

  

@LAST_MATCH_START

     

@ -

     

$-[0]是上次成功比赛开始的偏移量。 $-[n]是由 n -th子模式匹配的子字符串的开头的偏移量,如果子模式不匹配则为undef

例如,下面的程序在自己的捕获中抓住每一对的第二个单词并重新匹配匹配的位置,那么第二个单词将是下一个单词的第一个单词:

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

输出:

'the quick' => 1
'quick brown' => 1
'brown fox' => 1
'fox jumps' => 1
'jumps over' => 1
'over the' => 1
'the lazy' => 1
'lazy dog' => 1

答案 4 :(得分:1)

单独使用正则表达式是否有任何特殊原因?对我来说,明显的方法是将split文本放入数组中,然后使用一对嵌套循环从中提取计数。有点像:

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

输出:

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

编辑:修复$phrase_len循环以防止使用负面索引,这会导致每个cjm的评论导致错误的结果。