我试图使用nltk.parse.stanford的Stanford Parser来做一堆NLP任务。当我明确地传递句子或句子列表作为输入时,我可以对句子进行某些操作。 但是如何将大量文本分成句子? (显然,带有句号等的正则表达式不会很好)
我在这里检查了文档但没有找到任何内容:http://www.nltk.org/api/nltk.parse.html?highlight=stanford#module-nltk.parse.stanford
我在这里找到了类似于java工作的东西:How can I split a text into sentences using the Stanford parser?
我认为对于python版本的库我需要这样的东西。
答案 0 :(得分:3)
首先正确设置斯坦福工具和NLTK,例如在Linux中:
alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin ejml-0.23.jar lexparser-gui.sh LICENSE.txt README_dependencies.txt StanfordDependenciesManual.pdf
build.xml ejml-0.23-src.zip lexparser_lang.def Makefile README.txt stanford-parser-3.6.0-javadoc.jar
conf lexparser.bat lexparser-lang.sh ParserDemo2.java ShiftReduceDemo.java stanford-parser-3.6.0-models.jar
data lexparser-gui.bat lexparser-lang-train-test.sh ParserDemo.java slf4j-api.jar stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java lexparser-gui.command lexparser.sh pom.xml slf4j-simple.jar stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
(有关详细信息,请参阅https://gist.github.com/alvations/e1df0ba227e542955a8a,有关Windows说明,请参阅https://gist.github.com/alvations/0ed8641d7d2e1941b9f9)
然后,使用Kiss and Strunk (2006)将文本标记为字符串列表,其中列表中的每个项目都是一个句子。
>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']
然后将文档流提供给stanford解析器:
>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]
答案 1 :(得分:0)
这是nltk网站(http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence):
Punkt Sentence Tokenizer
此标记器通过使用无监督算法为缩写词,搭配和开始句子的单词构建模型,将文本划分为句子列表。必须先使用目标语言中的大量明文进行培训才能使用。*
示例代码:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize('hello there. how are you doing today, mr. bojangles?')))