Question

我需要拆分包含句子的字符串，例如：

"this is a sentence. this is another. Rawlings, G. stated foo and bar."

进入

["this is a sentence.", "this is another.", "Rawlings, G. stated foo and bar."]

使用正则表达式。

我找到的其他解决方案将第三句分为"Rawlings, G."和"stated foo and bar."，这不是我想要的。

Answer 1

正则表达式通常不能解决这个问题。

您需要一个句子检测算法，OpenNLP有一个

使用起来非常简单：

String sentences[] = sentenceDetector.sentDetect(yourString);

处理很多棘手的案件

“Walter White Jr.有钱”
“Pink先生不给出提示”

Answer 2

通过嵌套的lookbehinds。

根据以下正则表达式分割您的输入字符串。下面的正则表达式将根据刚好存在于点之后的边界分割输入字符串，并检查点的前一个字符。只有当dot的前一个字符不是一个超级字母时，它才会分裂。

String s = "this is a sentence. this is another. Rawlings, G. stated foo and bar.";
String[] tok = s.split("(?<=(?<![A-Z])\\.)");
System.out.println(Arrays.toString(tok));

<强>输出：

[this is a sentence.,  this is another.,  Rawlings, G. stated foo and bar.]

<强>解释

(?<=(?<![A-Z])\\.)匹配刚出现的点之后的边界，但点后面不会是大写字母。

Answer 3

我试过这个

import java.text.BreakIterator;
import java.util.Locale;

public class StringSplit {
    public static void main(String args[]) throws Exception {
        BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
        String source = "This is a sentence. This is another. Rawlings, G. stated foo and bar.";
        iterator.setText(source);
        int start = iterator.first();
        for ( int end = iterator.next(); 
              end != BreakIterator.DONE; 
              start = end, end = iterator.next()) {
            System.out.println(source.substring(start, end));
        }
    }
}

out put是

This is a sentence.
This is another.
Rawlings, G. stated foo and bar.

正则表达式将字符串分成句子

3 个答案: