如何检测和复制Perl中句子中的单词共现?

时间:2011-11-29 23:01:57

标签: perl text

我有一些单词,我有兴趣根据两个或多个单词的出现找到一个句子的重复:

示例:

我想在句子中发现'男孩'或'男孩'和'女孩'或'女孩',这样我就可以拥有这些套装:(男孩和女孩),(男孩和女孩),(女孩和男孩) )和(男孩和女孩)。

句子:

  

男孩正带着女孩上学,因为男孩喜欢女孩这么多。

句子代表:

  WORD1带着WORD2上学,因为WORD3非常喜欢WORD4。

我怎么能有四(4)种不同形式的句子,使它看起来像这样:

输出:

The WORD1 is going to school with a WORD2, because the WORD like the WORD so much.
The WORD1 is going to school with a WORD, because the WORD like the WORD4 so much.
The WORD is going to school with a WORD2, because the WORD3 like the WORD so much.
The WORD is going to school with a WORD, because the WORD3 like the WORD4 so much.

NB。

单词的数量可以是2或更多的动态;在这个例子中,我有4个单词。

2 个答案:

答案 0 :(得分:1)

使用反向引用:

if ($sentence =~ m/\b(\w+)\b.*\b\1/) {
  print "repeated use of the word $1\n";
}

答案 1 :(得分:1)

虽然它仍然需要大量改进,但以下内容应该让您开始并指出正确的方向:

#!/usr/bin/env perl

use strict;
use warnings;

use Algorithm::Permute;
use Lingua::EN::Tagger;
use Lingua::EN::Inflect::Number qw(to_S);

my $text = q{The boy is going to school with a girl, because the boys
like the girls so much.};

my $tagger = Lingua::EN::Tagger->new;

my $tagged_text = $tagger->add_tags( $text );

my %nouns = $tagger->get_nouns( $tagged_text );

my %normalized;
for my $noun (keys %nouns) {
    $normalized{ to_S($noun)}{ $noun } = undef;
}

for my $nouns (values %normalized) {
    my $p = Algorithm::Permute->new([ keys %$nouns ]);

    while (my @tuple = $p->next) {
        print join(', ', @tuple), "\n";
    }
}

输出:

boy, boys
boys, boy
school
girl, girls
girls, girl