sed - 如何获得段落的前2个句子?

时间:2011-04-26 13:01:05

标签: regex bash command-line sed

假设我有一段:

  

Lorem Ipsum只是虚拟文本   印刷和排版行业。   Lorem Ipsum一直是行业的佼佼者   从那以后的标准虚拟文本   1500年代,当一个未知的打印机拿了一个   类型的厨房,并乱扰它   制作一本样本书。它有   幸存下来不仅仅是五个世纪,而是   也是电子化的跃进   排版,基本上保留   不变。它在中国普及了   20世纪60年代随着Letraset的发布   包含Lorem Ipsum的床单   通道,最近与   像Aldus这样的桌面出版软件   PageMaker包括Lorem的版本   存有。

使用sed,我如何获得一定数量的句子,在这种情况下是2个句子,用句点分隔并仅从给定段落中提取以下文本。

  

Lorem Ipsum只是虚拟文本   印刷和排版行业。   Lorem Ipsum一直是行业的佼佼者   从那以后的标准虚拟文本   1500年代,当一个未知的打印机拿了一个   类型的厨房,并乱扰它   制作一本样本书。

5 个答案:

答案 0 :(得分:3)

sed 's/\(^[^.]*\.[^.]*\.\)\(.*$\)/\1/g'

说明:

\(启动小组

^匹配行的开头

[^.]*匹配任意数量的非句点字符

\.匹配期

[^.]*匹配任意数量的非句点字符

\.匹配期

\)结束组

\(开始小组.*$将所有内容匹配到行\)结尾群组的末尾。

\1用第一组替换整行。

答案 1 :(得分:3)

编辑:针对一些更棘手的案例进行了更新。

由于多种原因,sed很难做到这一点!首先,sed使我们很难处理文本中的标准多行段落。另一个原因是sed并未在所有平台上标准化,因此您永远不会知道它将支持哪种类型的模式或选项。所以也许其他人可以帮助你完成这一部分。

但在Perl中很容易做到。

use 5.10.0;
use strict;
use warnings;

my @texts = split /\R{2,}/, <<'END_OF_TEXT';
This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.
Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.
So perhaps someone else can help you with that part.

It was a dark and story night. Dr. Jones looked up
at the manor house with trepidation. Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires. Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one! Would anyone even notice
his mischief in time?  Dr. Jones chortled with glee as he scampered
up the step.
END_OF_TEXT


my $sentence_rx = qr{
    (?: (?<= ^ ) | (?<= \s ) )  # after start-of-string or whitespace
    \p{Lu}                      # capital letter
    .*?                         # a bunch of anything
    (?<= \S )                   # that ends in non-whitespace
    (?<! \b [DMS]r  )           # but isn't a common abbreviation
    (?<! \b Mrs )
    (?<! \b Sra )
    (?<! \b St  )
    [.?!]                       # followed by a sentence ender
    (?= $ | \s )                # in front of end-of-string or whitespace
}sx;

for my $paragraph (@texts) {
    say "NEW PARAGRAPH";
    say "Looking for each sentence.";

    my $count = 0;
    while ($paragraph =~ /($sentence_rx)/g) {
        printf "\tgot sentence %d: <%s>\n", ++$count, $1;
    }

    say "\nLooking for exactly two sentences.";

    if ($paragraph =~ / ^ ( (?: $sentence_rx \s*? ){2} ) /x) {
        say "\tgot two sentences: <<$1>>";
    }
    print "\n";
}

运行时,会产生此输出:

NEW PARAGRAPH
Looking for each sentence.
        got sentence 1: <This is hard to do in sed for several reasons!>
        got sentence 2: <First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>
        got sentence 3: <Another reason is that sed is not standardized across all platforms,
so you never know what sorts of patterns or options it will support.>
        got sentence 4: <So perhaps someone else can help you with that part.>

Looking for exactly two sentences.
        got two sentences: <<This is hard to do in sed for several reasons! First, sed makes it
hard to work on the standard multiline paragraphs we have in text.>>

NEW PARAGRAPH
Looking for each sentence.
        got sentence 1: <It was a dark and story night.>
        got sentence 2: <Dr. Jones looked up 
at the manor house with trepidation.>
        got sentence 3: <Lightning
flashes could be seen both outside the house and
inside it, as St. Elmo's fire played across the lofty
spires.>
        got sentence 4: <Mrs. Smith's fancy-dress party there on St. James's St.
was clearly going to be a lively one!>
        got sentence 5: <Would anyone even notice
his mischief in time?>
        got sentence 6: <Dr. Jones chortled with glee as he scampered 
up the step.>

Looking for exactly two sentences.
        got two sentences: <<It was a dark and story night. Dr. Jones looked up 
at the manor house with trepidation.>>

希望这会有所帮助。每次我尝试在sed中执行此操作时,都会变得非常复杂。  当然,你只能在sed走得那么远,而且我几乎总是需要走得更远,而不是让我走。如果不出意外,我需要一种可靠的方法来了解将支持哪种版本的正则表达式和交换机,并且您无法使用sed进行移植。编写可移植的shell脚本非常,非常比人们通常认为的要困难得多。我在这些操作系统上运行:

  • OpenBSD的
  • 达尔文(意思是Macs)
  • 的Linux
  • 的Solaris
  • AIX

所有这些之间最大的共同因素是如此微小,你永远无法使用shell工具完成任何有趣的事情 - 至少,不可移植。这真的很令人沮丧。令人惊讶的是,Perl的配置 shell脚本必须经历扭曲。

答案 2 :(得分:2)

您可以使用awk

 awk -vRS="." 'NR<=2' ORS="." file

将输入/输出记录分隔符设置为“。”,然后打印第一个和第二个记录(NR<=2)。如果你的句子没有Mr. James中的任意点,那么上面的内容应该足以满足你的需要,而不必制作复杂的正则表达式。

答案 3 :(得分:1)

这适用于您的示例:

sed 's/^\(\([^.]*\.\)\{2\}\).*/\1/'

或:

sed -r 's/^(([^.]*.){2}).*/\1/'

答案 4 :(得分:1)

这可能对您有用:

 sed 's/\(\.[^.]*\.\).*/\1/' file

如果每个段落都在一个单独的行上。

这可能适用于新行:

echo -e "a b c.\nx y z.\na b c" | sed ':a;$!N;/\(\.[^.]*\.\).*/!{$!ba};s//\1/;q'       
a b c.
x y z.