Question

我有文字：

 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

我希望能够确定哪些单词开始句子。我现在拥有的是：

 $ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'

这只是摆脱标点符号并用换行符替换它们，给我：

 This 
 is 
 a 
 test 

 This
 is
 only
 a
 test

 If
 there
 were
 an
 emergency,
 then
 Information
 would
 be
 provided
 for
 you

从这里我可以以某种方式提取上面没有任何内容（文件开头）或空格的单词，但我不确定如何做到这一点。

Answer 1

如果您的Perl至少为版本5.22.1（或5.22.0并且此案例不受the bug described here影响），那么您可以在正则表达式中使用句子边界。

use feature 'say';

foreach my $sentence (m/\b{sb}(\w+)/g) {
    say $sentence;
}

或者，作为一个单行：

perl -nE 'say for /\b{sb}(\w+)/g'

如果使用示例文本调用，则输出为：

This
This
If

它使用\b{sb}，这是句子边界。您可以阅读a tutorial at brian d foy's blog。 \b{}被称为unicode边界，在perlrebackslash中描述。

Answer 2

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

local $/;

my @words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;

print Dumper \@words;

__DATA__
 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

所以作为命令行：

perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile

Answer 3

您可以使用此gnu grep命令在每个句点后首先提取，或!或?：

grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file

This
This
If

虽然我必须提醒您，Mr. Smith等案件可能会导致错误的结果。

正则表达式分手：

(?:^|[.?!]) - 匹配开始或DOT或!或?
\s* - 匹配0个或更多空格
\K - 匹配重置以忘记匹配的数据
[A-Z][a-z]+ - 将单词startign与大写字母匹配

使用命令行和正则表达式来确定开始句子的单词

3 个答案: