Question

我正在使用Stanford CoreNLP在Windows机器上运行Java中的NLP项目。我想从中注释一篇大文章。我写的代码如下;

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, regexner");
StanfordCoreNLP pipeline =   new StanfordCoreNLP(props);
Annotation document = new Annotation("Text to be annotated. This text is very long!");
pipeline.annotate(document); // this line takes a long time

文本的注释需要相当长的时间。大约60个单词大约需要16秒，这太长了。

有没有办法加快这个处理速度，或者有什么我错过了。请告诉我我能做些什么。提前Thanx :-)

修改

代码示例

    public TextReader() {
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, regexner");
pipeline = new StanfordCoreNLP(props);
extractor = CoreMapExpressionExtractor.
                            createExtractorFromFiles(TokenSequencePattern.getNewEnv(), "Stanford NLP\\stanford-corenlp-full-2015-01-29\\stanford-corenlp-full-2015-01-30\\tokensregex\\color.rules.txt");
text = "Barak Obama was born on August 4, 1961,at Kapiolani Maternity & Gynecological Hospital "
+ " in Honolulu, Hawaii, and would become the first President to have been born in Hawaii. His mother, Stanley Ann Dunham,"
+ " was born in Wichita, Kansas, and was of mostly English ancestry. His father, Barack Obama, Sr., was a Luo from Nyang’oma"
+ " Kogelo, Kenya. He studied at the University of Westminster. His favourite colour is red.";
Logger.getLogger(TextReader.class.getName()).log(Level.INFO, "Annotator starting...", text); // LOG 1
Annotation document = new Annotation(text);
pipeline.annotate(document);
Logger.getLogger(TextReader.class.getName()).log(Level.INFO, "Annotator finished...", props); // LOG 2
sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
   //the tokens of the sentence are taken and iterated over
   // the NER, POS and lemma of the tokens are stores iteratively
}
}

我意识到LOG 1和LOG 2之间的时间约为16秒。我需要的是处理更长的文本，这需要很长时间。请告诉我我做错了什么？

Thanx = D

Answer 1

文字是一个长句吗？解析器的运行时间相对于句子的长度是O（n ^ 3），对于长于~40个单词的句子，它的速度相当慢。如果删除＆＃34; parse，dcoref，regexner＆＃34;注释器，它加快了吗？而且，如果您重新添加＆＃34;解析＆＃34;？

，它会再次减速吗？

如果你关心的是依赖解析而不是选区解析，那么新的＆＃34; depparse＆＃34;注释器会更快地产生这些;但是，我们的coref还不能用于依赖解析（即将推出！）。

斯坦福NLP注释文本非常慢

1 个答案: