Question

Quession Summary：stanford解析器的标记化在我的本地机器上很慢，但在火花上的速度要快得多。为什么？

我正在使用stanford coreNLP工具来标记句子。

我在Scala中的脚本是这样的：

import java.util.Properties
import scala.collection.JavaConversions._ 
import scala.collection.immutable.ListMap
import scala.io.Source

import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)

def tokenize(s: String)  = { 
  properties.setProperty("annotators", "tokenize")
  val annotation = new Annotation(s)
  coreNLP.annotate(annotation)
  annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}

tokenize("Here is my sentence.")

tokenize函数的一次调用大约（至少）0.1秒。这非常慢，因为我有300万句话。（3M * 0.1秒= 300K秒= 5000H）

作为替代方法，我在Spark上应用了tokenizer。（有四台工人机器。）

import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._  
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP

val file = sc.textFile("hdfs:///myfiles")

def tokenize(s: String)  = { 
  val properties = new Properties()
  properties.setProperty("annotators", "tokenize")
  val coreNLP = new StanfordCoreNLP(properties)
  val annotation = new Annotation(s)
  coreNLP.annotate(annotation)
  annotation.get(classOf[TokensAnnotation]).map(_.toString)
}

def normalizeToken(t: String) = {
  val ts = t.toLowerCase
  val num = "[0-9]+[,0-9]*".r
  ts match {
    case num() => "NUMBER"
    case _ => ts
  }
}

val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")

这个脚本在 5 minites 中完成了300万个句子的标记化和字数统计！结果似乎合理。为什么这是第一次？或者，为什么第一个scala脚本太慢？

Answer 1

第一种方法的问题是在初始化annotators对象后设置StanfordCoreNLP属性。因此，CoreNLP使用默认注释器列表进行初始化，其中包括词性标注器和解析器，它们比标记器慢几个数量级。

要解决此问题，只需移动线

即可

properties.setProperty("annotators", "tokenize")

行前

val coreNLP = new StanfordCoreNLP(properties)

这应该比第二种方法稍快一些，因为你不必为每个句子重新初始化CoreNLP。

斯坦福解析器的标记化很慢？

1 个答案: