Question

我试图使用nltk.parse.stanford的Stanford Parser来做一堆NLP任务。当我明确地传递句子或句子列表作为输入时，我可以对句子进行某些操作。 但是如何将大量文本分成句子？（显然，带有句号等的正则表达式不会很好）

我在这里检查了文档但没有找到任何内容：http://www.nltk.org/api/nltk.parse.html?highlight=stanford#module-nltk.parse.stanford

我在这里找到了类似于java工作的东西：How can I split a text into sentences using the Stanford parser?

我认为对于python版本的库我需要这样的东西。

Answer 1

首先正确设置斯坦福工具和NLTK，例如在Linux中：

alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin                        ejml-0.23.jar          lexparser-gui.sh              LICENSE.txt       README_dependencies.txt  StanfordDependenciesManual.pdf
build.xml                  ejml-0.23-src.zip      lexparser_lang.def            Makefile          README.txt               stanford-parser-3.6.0-javadoc.jar
conf                       lexparser.bat          lexparser-lang.sh             ParserDemo2.java  ShiftReduceDemo.java     stanford-parser-3.6.0-models.jar
data                       lexparser-gui.bat      lexparser-lang-train-test.sh  ParserDemo.java   slf4j-api.jar            stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java  lexparser-gui.command  lexparser.sh                  pom.xml           slf4j-simple.jar         stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

（有关详细信息，请参阅https://gist.github.com/alvations/e1df0ba227e542955a8a，有关Windows说明，请参阅https://gist.github.com/alvations/0ed8641d7d2e1941b9f9）

然后，使用Kiss and Strunk (2006)将文本标记为字符串列表，其中列表中的每个项目都是一个句子。

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']

然后将文档流提供给stanford解析器：

>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]

Answer 2

这是nltk网站（http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence）：

Punkt Sentence Tokenizer

此标记器通过使用无监督算法为缩写词，搭配和开始句子的单词构建模型，将文本划分为句子列表。必须先使用目标语言中的大量明文进行培训才能使用。*

示例代码：

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize('hello there. how are you doing today, mr. bojangles?')))

如何使用nltk.parse.stanford库拆分句子

2 个答案: