我正在使用PHP的preg_split()
将段落分成句子。这是我正在使用的正则表达式:
(?<=[\.\?\!]|(\."))\s(?=[A-Z\s\b])
它应匹配前面带有标点符号的空格,后跟空格或大写字母。但是,它不匹配这样的情况:
A "word. ".
我希望它将其拆分为两部分:A "word.
和".
,但它不匹配。我如何修复正则表达式?
答案 0 :(得分:0)
既然你已经承认它不是完美的,那么这是一个应该“适用于你”的正则表达式:
$paragraph = 'This is a sentence. "More sentence." Another? "MORE". Many more. She said "how do you do?" and I said "wtf".';
$sentences = preg_split('~([a-zA-Z]([.?!]"?|"?[.?!]))\K\s+(?=[A-Z"])~',$paragraph);
print_r($sentences);
输出:
Array
(
[0] => This is a sentence.
[1] => "More sentence."
[2] => Another?
[3] => "MORE".
[4] => Many more.
[5] => She said "how do you do?" and I said "wtf".
)
答案 1 :(得分:0)
您的正则表达式与您提供的示例不符。
您希望A "word. ".
与正则表达式匹配。现在正则表达式可以匹配两个空格:
A "word. ".
^ ^
你的正则表达式意味着:
一个空格,前面是[。?!]或。“(字面意思)( 1 ) 然后是大写字母或其他空格([A-Z \ s \ b])( 2 )
现在第一个空格前面有一个大写字母,因此根据 1 不匹配。
第二个空格以点开头,因此它是匹配的候选者,但它后面没有大写字母或其他空格(根据 2 ),因此没有匹配。
解决此问题的最简单方法是简单地将"
添加到您的预测中:
(?<=[.?!]|(\."))\s(?=[A-Z\s\b"])
^
但是,如果将段落分成句子,我怀疑这已经足够了,正如评论已经指出的那样。
答案 2 :(得分:0)
以下表达似乎相当不错:
$arr = preg_split('#(?<=[.?!](\s|"))\s?(?=[A-Z\b"])#',$str);
我在
上进行了测试当我的朋友说有一天他喜欢深盘披萨时,我 立即设定时间回到小星星。可以说是最好的 在SF的深盘披萨......虽然......我不相信有很多地方 做深盘披萨。那就是说......它不是最好的,只是 最好的“为该地区。”他们在地壳中或在地壳上使用玉米面 烘烤表面,所以有一点额外的紧缩。那就是 说...我不确定我多喜欢比萨饼的玉米面质地。 我知道,我只想要一个好的地壳?没有额外的东西可以尝试 让它变得更脆脆。
结果:
Array
(
[0] => When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star.
[1] => Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza.
[2] => That being said...its not the BEST ever, just the best "for the area."
[3] => They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it.
[4] => That being said...I'm not sure how much I like the cornmeal texture to my pizza.
[5] => I kind of want just a GOOD CRUST, you know?
[6] => No extra stuff to try to make it more crunchy.
)
但是,当您执行
之类的操作时,它会失败I met Ms. Scarlet in the library.
由于. S
将被解释为“新行的定义”。