这是SO上PHP sentences boundaries question的扩展名。
我想知道如何更改正则表达式以保持换行符。
示例代码按句子分割一些文本,删除一个句子,然后放回原处:
<?php
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
# or... (you get the idea).
) # End negative lookbehind.
[\s+|^$] # Split on whitespace between sentences/empty lines.
/ix';
$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
EOL;
echo "\nBefore: \n" . $text . "\n";
$sentences = preg_split($re, $text, -1);
$sentences[1] = " "; // remove 'sentence one'
// put text back together
$text = implode( $sentences );
echo "\nAfter: \n" . $text . "\n";
?>
运行它,输出
Before:
This is paragraph one. This is sentence one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
After:
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
我试图将'After'文本与'Before'文本相同,只删除一个句子。
After:
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
我希望通过正则表达式调整可以完成,但是我错过了什么?
答案 0 :(得分:1)
模式的结尾应替换为:
(?:\h+|^$) # Split on whitespace between sentences\/empty lines.
/mix';
请参阅IDEONE demo
请注意,[\s+|^$]
确实匹配空格(水平和垂直,如换行符),+
,|
,^
和{{ 1}}符号,因为它是字符类。
除了字符类之外,还需要一个组(更好,非捕获)。在组内(标有$
)(...)
作为替换运算符。
我建议使用仅匹配水平空白(无换行符)的|
而不是\s
。
如果未使用\h
多线修改器,^$
将只匹配空字符串。所以,我在选项中添加了/m
修饰符。
请注意,我必须在最后一条评论中转义/m
,否则会出现正则表达式不正确的警告。或者,使用不同的正则表达式分隔符。