如何列出包含相同单词的多个句子。标题是包含在这些句子中的单词

时间:2011-05-10 16:28:31

标签: perl

目前打印的所有名词都在右下方。

#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
my $search_key = "expend";    ## CHANGE "..." to <>

open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;

my @sentences = <$tag_corpus>;    # This breaks up each line into list
my @words;
my %seens = ();
my %seenw = ();

for (my $i = 0; $i <= @sentences; $i++) {
    if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {
        @words = split /\s/, $sentences[$i];    ## \s is a whitespace
        for (my $j = 0; $j <= @words; $j++) {
            #FILTER if word is noun, and therefore will end with _NN:
            if (defined($words[$j]) and $words[$j] =~ /_NN/) {
                #PRINT word (without _NN) and sentence (without any _ENDING):
                next if $seenw{$words[$j]}++;    ## How to include plural etc
                push @words, $words[$j];
                print "**", split(/_\S+/, $words[$j]), "**", "\n";
                ## next if $seens{ $sentences[$i] }++;
                ## push @sentences, $sentences[$i];
                print split(/_\S+/, $sentences[$i]), "\n"
                ## HOW PRINT bold or specifically word bold?
                #FILTER if word has been output, add sentence under that heading
            }
        }    ## put print sentences here to print each sentence after all the nouns inside
    }
}
close $tag_corpus || die "Can't close $tag_corpus: $!";

1 个答案:

答案 0 :(得分:1)

你原来的:

#!/usr/bin/perl
use strict;
use warnings FATAL => "all";

这是一个好的开始......

my $search_key = "expend";    ## CHANGE "..." to <>

因为你要在循环中的正则表达式中使用它,所以最好编译它 正则表达式:my $verb_regex = qr/\bexpend_VB\b/i。我把字边界放进去了 在那里,因为看起来你需要它们。 “

open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;

my @sentences = <$tag_corpus>;    # This breaks up each line into list
my @words;
my %seens = ();
my %seenw = ();

for (my $i = 0; $i <= @sentences; $i++) {

这与较少的开销

大致相同
while ( <$tag_corpus> ) { 
    ...

回到你的身边:

    if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {

如果该行包含记录分隔符 - 除非你chomp,否则它将永远存在 获取定义的行直到文件结尾。没有必要测试已定义的。

此外,您在搜索字词后不需要.*并捕获$search_key 这没有效果。

        @words = split /\s/, $sentences[$i];    ## \s is a whitespace

您不希望在单个空间上拆分空格。你应该使用/\s+/,但是 更好的是:@words = split ' ', $sentences[$i];

但你甚至不需要那样。

        for (my $j = 0; $j <= @words; $j++) {
            #FILTER if word is noun, and therefore will end with _NN:
            if (defined($words[$j]) and $words[$j] =~ /_NN/) {
                #PRINT word (without _NN) and sentence (without any _ENDING):

但是,这就是你要做的事情:_NN中的单词结束。另外,整体而言 将定义split的列表 - 无需测试。

                next if $seenw{$words[$j]}++;    ## How to include plural etc

除非您想在每个句子后重置%seenw,否则您只会处理每个_NN 单词一次每个文件。

                push @words, $words[$j];

通过附加名词,我看不出这个push如何为可能的目的服务 回到单词列表上。当然,在保存之前你已经进行了唯一性检查 如果有任何_NN个词,你就会从无限循环开始,但这只意味着你会拥有 句子中的所有单词,后面跟着所有的“名词”。不仅如此,你只是简单 去测试它是一个名词并且不做任何事情。更不用说你了 clobber 下一句话的列表。

                print "**", split(/_\S+/, $words[$j]), "**", "\n";

                ## next if $seens{ $sentences[$i] }++; 

您不希望在循环词

中执行此操作
                ## push @sentences, $sentences[$i];

同样,如果没有注释,我不认为你会想要这样做 在循环之外。似乎从2行之前的所有东西都是 在循环之后。

                print split(/_\S+/, $sentences[$i]), "\n"
                ## HOW PRINT bold or specifically word bold?
                #FILTER if word has been output, add sentence under that heading
            }
        }    ## put print sentences here to print each sentence after all the nouns inside
    }
}
close $tag_corpus || die "Can't close $tag_corpus: $!";

不。那将无法处理收盘时的不良回报。 ||或者也是“绑定” 紧紧。您正在关闭$tag_corpus或骰子的输出。幸运的是(或许是不幸的) 永远不会被称为死亡,因为如果我们到目前为止,$tag_corpus应该是一个 真实的价值。

这是你正在尝试做的一种清理版本 - 用 我能理解的部分。

my @sentences;
# We're processing a single line at a time.
while ( <$tag_corpus> ) { 
    # Test if we want to work with the line
    next unless m/$verb_regex/;
    # If we do, then test that we haven't dealt with it before
    # Although I suspect that this may not be needed as much if we're not 
    # pushing to a queue that we're reading from.
    next if    $seens{ $_ }++;

    # split -> split ' ', $_
    # pass through only those words that match _NN at the end and
    # are unique so far. We test on a substitution, because the result
    # still uniquely identifies a noun
    foreach my $noun ( grep { s/_NN$// && !$seenw{ $_ }++ } split ) { 
        print "**$noun**\n";
    }
    # This will omit any adjacent punctuation you have after the word--if 
    # that's a problem.
    print split( /_\S+/ ), "\n";
    # Here we save the sentence.
    push @sentences, $_;
}
close $tag_corpus or die "Can't close ch13tagged.txt: $!";