使用命令行和正则表达式来确定开始句子的单词

时间:2016-09-14 15:08:46

标签: regex perl grep

我有文字:

 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

我希望能够确定哪些单词开始句子。我现在拥有的是:

 $ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'

这只是摆脱标点符号并用换行符替换它们,给我:

 This 
 is 
 a 
 test 

 This
 is
 only
 a
 test

 If
 there
 were
 an
 emergency,
 then
 Information
 would
 be
 provided
 for
 you

从这里我可以以某种方式提取上面没有任何内容(文件开头)或空格的单词,但我不确定如何做到这一点。

3 个答案:

答案 0 :(得分:6)

如果您的Perl至少为版本5.22.1(或5.22.0并且此案例不受the bug described here影响),那么您可以在正则表达式中使用句子边界。

use feature 'say';

foreach my $sentence (m/\b{sb}(\w+)/g) {
    say $sentence;
}

或者,作为一个单行:

perl -nE 'say for /\b{sb}(\w+)/g'

如果使用示例文本调用,则输出为:

This
This
If

它使用\b{sb},这是句子边界。您可以阅读a tutorial at brian d foy's blog\b{}被称为unicode边界,在perlrebackslash中描述。

答案 1 :(得分:1)

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

local $/;

my @words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;

print Dumper \@words;

__DATA__
 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.

所以作为命令行:

perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile

答案 2 :(得分:1)

您可以使用此gnu grep命令在每个句点后首先提取,或!?

grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file

This
This
If

虽然我必须提醒您,Mr. Smith等案件可能会导致错误的结果。

正则表达式分手:

  • (?:^|[.?!]) - 匹配开始或DOT或!?
  • \s* - 匹配0个或更多空格
  • \K - 匹配重置以忘记匹配的数据
  • [A-Z][a-z]+ - 将单词startign与大写字母匹配