Question

我必须将中文文本分成多个句子。我试过Stanford DocumentPreProcessor。它适用于英语，但不适用于中文。

请你告诉我任何中文优秀的句子分割器，最好用Java或Python。

Answer 1

在Python中使用一些正则表达式技巧（参见http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf第2.3节的修改后的正则表达式）：

import re

paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'

def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

list(zng(paragraph))

正则表达式解释：https://regex101.com/r/eNFdqM/2

Answer 2

这些开源项目中的任何一个都应该是有用的：

HanLP https://github.com/hankcs/HanLP
FudanNLP https://github.com/FudanNLP/fnlp

Answer 3

对于未分段的文本，使用斯坦福库，您可能想要使用他们的中文CoreNLP。这并没有记录为基本的corenlp，但它适用于您的任务。

http://nlp.stanford.edu/software/corenlp-faq.shtml#languages http://nlp.stanford.edu/software/corenlp.shtml

您需要分段器和句子分割器。＆＃34; segment，ssplit＆＃34;其他的不相关。

或者，您可以直接在edu.stanford.nlp.process.WordToSentenceSplitter中使用WordToSentenceSplitter类。如果你这样做，你可以看看它在WordsToSentencesAnnotator中的使用方式。

将中文文档拆分成句子

3 个答案: