Question

我正在使用PHP的preg_split()将段落分成句子。这是我正在使用的正则表达式：

(?<=[\.\?\!]|(\."))\s(?=[A-Z\s\b])

它应匹配前面带有标点符号的空格，后跟空格或大写字母。但是，它不匹配这样的情况：

A "word. ".

我希望它将其拆分为两部分：A "word.和".，但它不匹配。我如何修复正则表达式？

Answer 1

既然你已经承认它不是完美的，那么这是一个应该“适用于你”的正则表达式：

$paragraph = 'This is a sentence. "More sentence." Another? "MORE". Many more. She said "how do you do?" and I said "wtf".';
$sentences = preg_split('~([a-zA-Z]([.?!]"?|"?[.?!]))\K\s+(?=[A-Z"])~',$paragraph);

print_r($sentences);

输出：

Array
(
    [0] => This is a sentence.
    [1] => "More sentence."
    [2] => Another?
    [3] => "MORE".
    [4] => Many more.
    [5] => She said "how do you do?" and I said "wtf".
)

Answer 2

您的正则表达式与您提供的示例不符。

您希望A "word. ".与正则表达式匹配。现在正则表达式可以匹配两个空格：

A "word. ".
 ^      ^

你的正则表达式意味着：

一个空格，前面是[。？！]或。“（字面意思）（ 1 ）然后是大写字母或其他空格（[A-Z \ s \ b]）（ 2 ）

现在第一个空格前面有一个大写字母，因此根据 1 不匹配。

第二个空格以点开头，因此它是匹配的候选者，但它后面没有大写字母或其他空格（根据 2 ），因此没有匹配。

解决此问题的最简单方法是简单地将"添加到您的预测中：

(?<=[.?!]|(\."))\s(?=[A-Z\s\b"])
                             ^

但是，如果将段落分成句子，我怀疑这已经足够了，正如评论已经指出的那样。

Answer 3

以下表达似乎相当不错：

$arr = preg_split('#(?<=[.?!](\s|"))\s?(?=[A-Z\b"])#',$str);

我在

上进行了测试

当我的朋友说有一天他喜欢深盘披萨时，我立即设定时间回到小星星。可以说是最好的在SF的深盘披萨......虽然......我不相信有很多地方做深盘披萨。那就是说......它不是最好的，只是最好的“为该地区。”他们在地壳中或在地壳上使用玉米面烘烤表面，所以有一点额外的紧缩。那就是说...我不确定我多喜欢比萨饼的玉米面质地。我知道，我只想要一个好的地壳？没有额外的东西可以尝试让它变得更脆脆。

结果：

Array
(
    [0] => When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star. 
    [1] => Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza. 
    [2] => That being said...its not the BEST ever, just the best "for the area."
    [3] => They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it. 
    [4] => That being said...I'm not sure how much I like the cornmeal texture to my pizza. 
    [5] => I kind of want just a GOOD CRUST, you know? 
    [6] => No extra stuff to try to make it more crunchy.
)

但是，当您执行

之类的操作时，它会失败

I met Ms. Scarlet in the library.

由于. S将被解释为“新行的定义”。

正则表达式分裂句子

3 个答案: