我有文字:
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
我希望能够确定哪些单词开始句子。我现在拥有的是:
$ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'
这只是摆脱标点符号并用换行符替换它们,给我:
This
is
a
test
This
is
only
a
test
If
there
were
an
emergency,
then
Information
would
be
provided
for
you
从这里我可以以某种方式提取上面没有任何内容(文件开头)或空格的单词,但我不确定如何做到这一点。
答案 0 :(得分:6)
如果您的Perl至少为版本5.22.1(或5.22.0并且此案例不受the bug described here影响),那么您可以在正则表达式中使用句子边界。
use feature 'say';
foreach my $sentence (m/\b{sb}(\w+)/g) {
say $sentence;
}
或者,作为一个单行:
perl -nE 'say for /\b{sb}(\w+)/g'
如果使用示例文本调用,则输出为:
This
This
If
它使用\b{sb}
,这是句子边界。您可以阅读a tutorial at brian d foy's blog。 \b{}
被称为unicode边界,在perlrebackslash中描述。
答案 1 :(得分:1)
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
local $/;
my @words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;
print Dumper \@words;
__DATA__
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
所以作为命令行:
perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile
答案 2 :(得分:1)
您可以使用此gnu grep命令在每个句点后首先提取,或!
或?
:
grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file
This
This
If
虽然我必须提醒您,Mr. Smith
等案件可能会导致错误的结果。
正则表达式分手:
(?:^|[.?!])
- 匹配开始或DOT或!
或?
\s*
- 匹配0个或更多空格\K
- 匹配重置以忘记匹配的数据[A-Z][a-z]+
- 将单词startign与大写字母匹配