正则表达式将文本文档拆分成句子

时间:2013-07-15 12:52:52

标签: java regex split text-segmentation

我有一个大文本字符串,我试图将其拆分为基于“。?!”的句子。但我的正则表达式不能以某种方式工作,有人可以指导我检测错误吗?

String str = "When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star. Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza. That being said...its not the BEST ever, just the best for the area. They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it. That being said...I'm not sure how much I like the cornmeal texture to my pizza. I kind of want just a GOOD CRUST, you know? No extra stuff to try to make it more crunchy.";
String[] sentences = str.split("/(?<=[.?!])\\S+(?=[a-z])/i");

但它并没有分裂句子。有人可以检测到错误吗?

2 个答案:

答案 0 :(得分:2)

你有错误的正则表达式。 Java不像这个PCRE类型正则表达式那样理解正则表达式:

/(?<=[.?!])\\S+(?=[a-z])/i

使用此:

String[] sentences = str.split("(?i)(?<=[.?!])\\S+(?=[a-z])");

答案 1 :(得分:2)

这是一个小小的提示:

斜杠与正则表达式没什么关系

斜杠是* some + languages的应用程序语言。 Java不是其中之一。

尝试删除斜杠并将“/ i”替换为“(?i)”:

String[] sentences = str.split("(?i)(?<=[.?!])\\S+(?=[a-z])");