正则表达式分裂句子

时间:2014-02-23 04:03:40

标签: php regex

我正在使用PHP的preg_split()将段落分成句子。这是我正在使用的正则表达式:

(?<=[\.\?\!]|(\."))\s(?=[A-Z\s\b])

它应匹配前面带有标点符号的空格,后跟空格或大写字母。但是,它不匹配这样的情况:

A "word. ".

我希望它将其拆分为两部分:A "word.".,但它不匹配。我如何修复正则表达式?

3 个答案:

答案 0 :(得分:0)

既然你已经承认它不是完美的,那么这是一个应该“适用于你”的正则表达式:

$paragraph = 'This is a sentence. "More sentence." Another? "MORE". Many more. She said "how do you do?" and I said "wtf".';
$sentences = preg_split('~([a-zA-Z]([.?!]"?|"?[.?!]))\K\s+(?=[A-Z"])~',$paragraph);

print_r($sentences);

输出:

Array
(
    [0] => This is a sentence.
    [1] => "More sentence."
    [2] => Another?
    [3] => "MORE".
    [4] => Many more.
    [5] => She said "how do you do?" and I said "wtf".
)

答案 1 :(得分:0)

您的正则表达式与您提供的示例不符。

您希望A "word. ".与正则表达式匹配。现在正则表达式可以匹配两个空格:

A "word. ".
 ^      ^

你的正则表达式意味着:

  

一个空格,前面是[。?!]或。“(字面意思)( 1 )   然后是大写字母或其他空格([A-Z \ s \ b])( 2

现在第一个空格前面有一个大写字母,因此根据 1 不匹配。

第二个空格以点开头,因此它是匹配的候选者,但它后面没有大写字母或其他空格(根据 2 ),因此没有匹配。

解决此问题的最简单方法是简单地将"添加到您的预测中:

(?<=[.?!]|(\."))\s(?=[A-Z\s\b"])
                             ^

但是,如果将段落分成句子,我怀疑这已经足够了,正如评论已经指出的那样。

答案 2 :(得分:0)

以下表达似乎相当不错:

$arr = preg_split('#(?<=[.?!](\s|"))\s?(?=[A-Z\b"])#',$str);

我在

上进行了测试
  

当我的朋友说有一天他喜欢深盘披萨时,我   立即设定时间回到小星星。可以说是最好的   在SF的深盘披萨......虽然......我不相信有很多地方   做深盘披萨。那就是说......它不是最好的,只是   最好的“为该地区。”他们在地壳中或在地壳上使用玉米面   烘烤表面,所以有一点额外的紧缩。那就是   说...我不确定我多喜欢比萨饼的玉米面质地。   我知道,我只想要一个好的地壳?没有额外的东西可以尝试   让它变得更脆脆。

结果:

Array
(
    [0] => When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star. 
    [1] => Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza. 
    [2] => That being said...its not the BEST ever, just the best "for the area."
    [3] => They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it. 
    [4] => That being said...I'm not sure how much I like the cornmeal texture to my pizza. 
    [5] => I kind of want just a GOOD CRUST, you know? 
    [6] => No extra stuff to try to make it more crunchy.
)

但是,当您执行

之类的操作时,它会失败
I met Ms. Scarlet in the library.

由于. S将被解释为“新行的定义”。