spark stanford解析器内存不足

时间:2017-08-09 13:44:17

标签: scala apache-spark stanford-nlp

我在Spark 1.5上使用StanfordCoreNLP 2.4.1来解析中文句子,但遇到了Java堆OOM异常。代码如下:

private static final UUID WIDEVINE_UUID = new UUID(0xEDEF8BA979D64ACEL, 0xA3C827DCD51D21EDL);

@TargetApi (Build.VERSION_CODES.JELLY_BEAN_MR2)
@SuppressWarnings("ResourceType")
private void getWVDrmInfo() {
MediaDrm mediaDrm = null;
try {
  mediaDrm = new MediaDrm(WIDEVINE_UUID);

  String vendor = mediaDrm.getPropertyString(MediaDrm.PROPERTY_VENDOR);
  String version = mediaDrm.getPropertyString(MediaDrm.PROPERTY_VERSION);
  String description = mediaDrm.getPropertyString(MediaDrm.PROPERTY_DESCRIPTION);
  String algorithms = mediaDrm.getPropertyString(MediaDrm.PROPERTY_ALGORITHMS);
  String securityLevel = mediaDrm.getPropertyString("securityLevel");
  String systemId = mediaDrm.getPropertyString("systemId");
  String hdcpLevel = mediaDrm.getPropertyString("hdcpLevel");
  String maxHdcpLevel = mediaDrm.getPropertyString("maxHdcpLevel");
  String usageReportingSupport = mediaDrm.getPropertyString("usageReportingSupport");
  String maxNumberOfSessions = mediaDrm.getPropertyString("maxNumberOfSessions");
  String numberOfOpenSessions = mediaDrm.getPropertyString("numberOfOpenSessions");

  mediaDrm.release();
} catch (UnsupportedSchemeException e) {
  e.printStackTrace();
}
}

我输入解析器的句子是使用空格的分段单词连接的句子。

我遇到的例外是:

val modelpath = "edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz"
val lp = LexicalizedParser.loadModel(modelpath)
val dataWords = data.map(x=>{
      val tokens = x.split("\t")
      val id = tokens(0)
      val word_seg = tokens(2)
      val comm_words = word_seg.split("\1").filter(_.split(":").length == 2).map(y=>(y.split(":")(0), y.split(":")(1)))
      (id, comm_words)
    }).filter(_._2.nonEmpty)
val dataSenSlice = dataWords.map(x=>{
      val id = x._1
      val comm_words = x._2
      val punctuationIndex = Array(0) ++ comm_words.zipWithIndex.filter(_._1._2 == "34").map(_._2) ++ Array(comm_words.length - 1)
      val senIndex = (punctuationIndex zip punctuationIndex.tail).filter(z => z._1 != z._2)
      val senSlice = senIndex.map(z=>{
        val begin = if (z._1 > 0) z._1 + 1 else z._1
        val end = if (z._2 == comm_words.length - 1) z._2 + 1 else z._2
        if (comm_words.slice(begin, end).filter(_._2 != "34").nonEmpty) {
          val sen = comm_words.slice(begin, end).filter(_._2 != "34").map(_._1).mkString(" ").trim
          sen
        } else ""
      }).filter(l=>l.nonEmpty && l.length<20)
      (id, senSlice)
    }).filter(_._2.nonEmpty)
val dataPoint = dataSenSlice.map(x=>{
      val id = x._1
      val senSlice = x._2
      val senParse = senSlice.map(y=>{
        StanfordNLPParser.senParse(lp, y)// java code wrapped sentence parser
      })
      id + "\t" + senParse.mkString("\1")
    })
dataPoint.saveAsTextFile(PARSED_MERGED_POI)

我想知道我是否使用正确的方法进行句子解析,或者其他一些错误。

1 个答案:

答案 0 :(得分:1)

建议:

  1. 增加分区数量,例如
  2. 
        data.repartition(500)
    
    

    随机重新调整RDD中的数据以创建更多或更少的分区并在它们之间进行平衡。这总是随机播放网络上的所有数据。

    1. 增加执行程序和驱动程序内存,例如添加&#39; spark-submit&#39;参数:
    2. 
          --executor-memory 8G
          --driver-memory 4G