Spark 2.1:将DataFrame写入镶木地板文件时内存不足?

时间:2017-10-23 17:15:12

标签: scala apache-spark out-of-memory spark-dataframe

我正在尝试将一个DataFrame(大约1400万行)写入一个本地的镶木地板文件,但是我一直在耗尽内存。我有一张大地图val myMap : Map[String,Seq[Double]],并在udf中使用此地图通过val newDF = df.withColumn("stuff",udfWithMap)获取非常大的数据框架。我有128G的RAM可用,在将DataFrame持久保存到DISK_ONLY并执行df.show后,我还剩下大约100G的RAM。但是,当我尝试df.write.parquet时,Spark的内存量需要大量涌现而内存不足。我也试过播放myMap,但这样做似乎对内存没有任何影响。有什么问题?

这是我的代码示例:

scala> type LookupMapSeq = (String, Seq[Double])

scala> val myMap = sc.objectFile[LookupMapSeq]("file:///data/dir/myMap").collectAsMap()

/* myMap.size is about 150,000 and each Seq[String] is of size 200 */

scala> val combineudf = functions.udf[Seq[Double], Seq[String]] { v1 =>
  val wordVec = v1.map(y => myMap.getOrElse(y, Seq.fill(200)(0.0)))
  wordVec.foldLeft(Seq.fill(200)(0.0)) { case (acc, list) =>
    acc.zipWithIndex.map { case (value, i) => value + list(i) }
  }
}

scala> import org.apache.spark.storage.StorageLevel

scala> val df6 = df3.withColumn("sum", combineudf(df3("filtered"))).persist(StorageLevel.DISK_ONLY)

scala> df6.show

+--------+--------------------+--------------------+-------------------
-+
|    pmid|            filtered|            TFIDFvec|                 sum|
+--------+--------------------+--------------------+--------------------+
|25393341|[retreatment, rec...|[0.0, 26.21009534...|[4.34963607663623...|
|25394466|[lactate, dehydro...|[21.3762879413052...|[-17.550128685500...|
|25394717|[aim, study, inve...|[3.11641169932197...|[-54.981726214632...|    

scala> df6.write.parquet("file:///data/dir/df6")

/* This results in out of memory */

0 个答案:

没有答案