斯坦福NLP注释文本非常慢

时间:2015-04-09 15:52:49

标签: java performance stanford-nlp

我正在使用Stanford CoreNLP在Windows机器上运行Java中的NLP项目。我想从中注释一篇大文章。我写的代码如下;

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, regexner");
StanfordCoreNLP pipeline =   new StanfordCoreNLP(props);
Annotation document = new Annotation("Text to be annotated. This text is very long!");
pipeline.annotate(document); // this line takes a long time

文本的注释需要相当长的时间。 大约60个单词大约需要16秒,这太长了。

有没有办法加快这个处理速度,或者有什么我错过了。 请告诉我我能做些什么。 提前Thanx :-)

修改

代码示例

    public TextReader() {
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, regexner");
pipeline = new StanfordCoreNLP(props);
extractor = CoreMapExpressionExtractor.
                            createExtractorFromFiles(TokenSequencePattern.getNewEnv(), "Stanford NLP\\stanford-corenlp-full-2015-01-29\\stanford-corenlp-full-2015-01-30\\tokensregex\\color.rules.txt");
text = "Barak Obama was born on August 4, 1961,at Kapiolani Maternity & Gynecological Hospital "
+ " in Honolulu, Hawaii, and would become the first President to have been born in Hawaii. His mother, Stanley Ann Dunham,"
+ " was born in Wichita, Kansas, and was of mostly English ancestry. His father, Barack Obama, Sr., was a Luo from Nyang’oma"
+ " Kogelo, Kenya. He studied at the University of Westminster. His favourite colour is red.";
Logger.getLogger(TextReader.class.getName()).log(Level.INFO, "Annotator starting...", text); // LOG 1
Annotation document = new Annotation(text);
pipeline.annotate(document);
Logger.getLogger(TextReader.class.getName()).log(Level.INFO, "Annotator finished...", props); // LOG 2
sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
   //the tokens of the sentence are taken and iterated over
   // the NER, POS and lemma of the tokens are stores iteratively
}
}

我意识到LOG 1和LOG 2之间的时间约为16秒。我需要的是处理更长的文本,这需要很长时间。请告诉我我做错了什么?

Thanx = D

1 个答案:

答案 0 :(得分:1)

文字是一个长句吗?解析器的运行时间相对于句子的长度是O(n ^ 3),对于长于~40个单词的句子,它的速度相当慢。如果删除" parse,dcoref,regexner"注释器,它加快了吗?而且,如果您重新添加"解析"?

,它会再次减速吗?

如果你关心的是依赖解析而不是选区解析,那么新的" depparse"注释器会更快地产生这些;但是,我们的coref还不能用于依赖解析(即将推出!)。