如何使用nltk.parse.stanford库拆分句子

时间:2016-04-09 02:41:27

标签: python parsing nlp nltk stanford-nlp

我试图使用nltk.parse.stanford的Stanford Parser来做一堆NLP任务。当我明确地传递句子或句子列表作为输入时,我可以对句子进行某些操作。 但是如何将大量文本分成句子? (显然,带有句号等的正则表达式不会很好)

我在这里检查了文档但没有找到任何内容:http://www.nltk.org/api/nltk.parse.html?highlight=stanford#module-nltk.parse.stanford

我在这里找到了类似于java工作的东西:How can I split a text into sentences using the Stanford parser?

我认为对于python版本的库我需要这样的东西。

2 个答案:

答案 0 :(得分:3)

首先正确设置斯坦福工具和NLTK,例如在Linux中:

alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin                        ejml-0.23.jar          lexparser-gui.sh              LICENSE.txt       README_dependencies.txt  StanfordDependenciesManual.pdf
build.xml                  ejml-0.23-src.zip      lexparser_lang.def            Makefile          README.txt               stanford-parser-3.6.0-javadoc.jar
conf                       lexparser.bat          lexparser-lang.sh             ParserDemo2.java  ShiftReduceDemo.java     stanford-parser-3.6.0-models.jar
data                       lexparser-gui.bat      lexparser-lang-train-test.sh  ParserDemo.java   slf4j-api.jar            stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java  lexparser-gui.command  lexparser.sh                  pom.xml           slf4j-simple.jar         stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

(有关详细信息,请参阅https://gist.github.com/alvations/e1df0ba227e542955a8a,有关Windows说明,请参阅https://gist.github.com/alvations/0ed8641d7d2e1941b9f9

然后,使用Kiss and Strunk (2006)将文本标记为字符串列表,其中列表中的每个项目都是一个句子。

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']

然后将文档流提供给stanford解析器:

>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]

答案 1 :(得分:0)

这是nltk网站(http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence):

Punkt Sentence Tokenizer

此标记器通过使用无监督算法为缩写词,搭配和开始句子的单词构建模型,将文本划分为句子列表。必须先使用目标语言中的大量明文进行培训才能使用。*

示例代码:

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize('hello there. how are you doing today, mr. bojangles?')))