我需要拆分包含句子的字符串,例如:
"this is a sentence. this is another. Rawlings, G. stated foo and bar."
进入
["this is a sentence.", "this is another.", "Rawlings, G. stated foo and bar."]
使用正则表达式。
我找到的其他解决方案将第三句分为"Rawlings, G."
和"stated foo and bar."
,这不是我想要的。
答案 0 :(得分:6)
正则表达式通常不能解决这个问题。
您需要一个句子检测算法,OpenNLP有一个
使用起来非常简单:
String sentences[] = sentenceDetector.sentDetect(yourString);
处理很多棘手的案件
答案 1 :(得分:3)
通过嵌套的lookbehinds。
根据以下正则表达式分割您的输入字符串。下面的正则表达式将根据刚好存在于点之后的边界分割输入字符串,并检查点的前一个字符。只有当dot的前一个字符不是一个超级字母时,它才会分裂。
String s = "this is a sentence. this is another. Rawlings, G. stated foo and bar.";
String[] tok = s.split("(?<=(?<![A-Z])\\.)");
System.out.println(Arrays.toString(tok));
<强>输出:强>
[this is a sentence., this is another., Rawlings, G. stated foo and bar.]
<强>解释强>
(?<=(?<![A-Z])\\.)
匹配刚出现的点之后的边界,但点后面不会是大写字母。答案 2 :(得分:1)
我试过这个
import java.text.BreakIterator;
import java.util.Locale;
public class StringSplit {
public static void main(String args[]) throws Exception {
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a sentence. This is another. Rawlings, G. stated foo and bar.";
iterator.setText(source);
int start = iterator.first();
for ( int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start, end));
}
}
}
out put是
This is a sentence.
This is another.
Rawlings, G. stated foo and bar.