PHP句子边界包括空行?

时间:2015-12-02 21:06:34

标签: php regex

这是SO上PHP sentences boundaries question的扩展名。

我想知道如何更改正则表达式以保持换行符。

示例代码按句子分割一些文本,删除一个句子,然后放回原处:

<?php
$re = '/# Split sentences on whitespace between them.
    (?<=                # Begin positive lookbehind.
      [.!?]             # Either an end of sentence punct,
    | [.!?][\'"]        # or end of sentence punct and quote.
    )                   # End positive lookbehind.
    (?<!                # Begin negative lookbehind.
      Mr\.              # Skip either "Mr."
    | Mrs\.             # or "Mrs.",
    | Ms\.              # or "Ms.",
    | Jr\.              # or "Jr.",
    | Dr\.              # or "Dr.",
    | Prof\.            # or "Prof.",
    | Sr\.              # or "Sr.",
    | T\.V\.A\.         # or "T.V.A.",
                        # or... (you get the idea).
    )                   # End negative lookbehind.
    [\s+|^$]            # Split on whitespace between sentences/empty lines.
    /ix';

$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!
EOL;

echo "\nBefore: \n" . $text . "\n";

$sentences = preg_split($re, $text, -1);

$sentences[1] = " "; // remove 'sentence one'

// put text back together
$text = implode( $sentences );

echo "\nAfter: \n" . $text . "\n";
?>

运行它,输出

Before: 
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

After: 
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!

我试图将'After'文本与'Before'文本相同,只删除一个句子。

After: 
This is paragraph one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

我希望通过正则表达式调整可以完成,但是我错过了什么?

1 个答案:

答案 0 :(得分:1)

模式的结尾应替换为:

  (?:\h+|^$)          # Split on whitespace between sentences\/empty lines.
/mix';

请参阅IDEONE demo

请注意,[\s+|^$]确实匹配空格(水平和垂直,如换行符),+|^和{{ 1}}符号,因为它是字符类

除了字符类之外,还需要一个组(更好,非捕获)。在组内(标有$(...)作为替换运算符。

我建议使用仅匹配水平空白(无换行符)的|而不是\s

如果未使用\h多线修改器,^$将只匹配空字符串。所以,我在选项中添加了/m修饰符。

请注意,我必须在最后一条评论中转义/m,否则会出现正则表达式不正确的警告。或者,使用不同的正则表达式分隔符。