StanfordCoreNLP的解析树的生成被卡住

时间:2017-05-10 02:15:07

标签: scala stanford-nlp parse-tree

当我使用StanfordCoreNLP在Spark上使用bigdata生成解析时,其中一个任务已经停留了很长时间。我查找了错误,它显示如下:

    at edu.stanford.nlp.ling.CoreLabel.(CoreLabel.java:68)
      at edu.stanford.nlp.ling.CoreLabel$CoreLabelFactory.newLabel(CoreLabel.java:248)
      at edu.stanford.nlp.trees.LabeledScoredTreeFactory.newLeaf(LabeledScoredTreeFactory.java:51)
      at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:27)
      at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
      at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
      at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
      at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)

我认为相关的代码如下:

import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConversions._

object CoreNLP {
    def transform(Content: String): String = {
        val v = new CoreNLP
        v.runEnglishAnnotators(Content);
        v.runChineseAnnotators(Content)
    }
}

class CoreNLP {
    def runEnglishAnnotators(inputContent: String): String = {
        var document = new Annotation(inputContent)
        val props = new Properties
        props.setProperty("annotators", "tokenize, ssplit, parse")
        val coreNLP = new StanfordCoreNLP(props)
        coreNLP.annotate(document)
        parserOutput(document)
    }

    def runChineseAnnotators(inputContent: String): String = {
        var document = new Annotation(inputContent)
        val props = new Properties
        val corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties")
        corenlp.annotate(document)
        parserOutput(document)
    }

    def parserOutput(document: Annotation): String = { 
        val sentences = document.get(classOf[SentencesAnnotation])
        var result = ""
        for (sentence: CoreMap <- sentences) { 
        val tree = sentence.get(classOf[TreeAnnotation])
        //output the  tree to file
        result = result + "\n" + tree.toString
    }
    result
    }
}

我的同学说用于测试的数据是递归的,因此NLP无休止地运行。我不知道这是否属实。

1 个答案:

答案 0 :(得分:0)

如果您在代码中添加props.setProperty("parse.maxlen", "100");,则会将解析器设置为不解析超过100个令牌的句子。这有助于防止崩溃问题。您应该为您的应用程序试验最佳最大句子长度。