我正在尝试将一个DataFrame(大约1400万行)写入一个本地的镶木地板文件,但是我一直在耗尽内存。我有一张大地图val myMap : Map[String,Seq[Double]]
,并在udf
中使用此地图通过val newDF = df.withColumn("stuff",udfWithMap)
获取非常大的数据框架。我有128G的RAM可用,在将DataFrame持久保存到DISK_ONLY并执行df.show
后,我还剩下大约100G的RAM。但是,当我尝试df.write.parquet
时,Spark的内存量需要大量涌现而内存不足。我也试过播放myMap
,但这样做似乎对内存没有任何影响。有什么问题?
这是我的代码示例:
scala> type LookupMapSeq = (String, Seq[Double])
scala> val myMap = sc.objectFile[LookupMapSeq]("file:///data/dir/myMap").collectAsMap()
/* myMap.size is about 150,000 and each Seq[String] is of size 200 */
scala> val combineudf = functions.udf[Seq[Double], Seq[String]] { v1 =>
val wordVec = v1.map(y => myMap.getOrElse(y, Seq.fill(200)(0.0)))
wordVec.foldLeft(Seq.fill(200)(0.0)) { case (acc, list) =>
acc.zipWithIndex.map { case (value, i) => value + list(i) }
}
}
scala> import org.apache.spark.storage.StorageLevel
scala> val df6 = df3.withColumn("sum", combineudf(df3("filtered"))).persist(StorageLevel.DISK_ONLY)
scala> df6.show
+--------+--------------------+--------------------+-------------------
-+
| pmid| filtered| TFIDFvec| sum|
+--------+--------------------+--------------------+--------------------+
|25393341|[retreatment, rec...|[0.0, 26.21009534...|[4.34963607663623...|
|25394466|[lactate, dehydro...|[21.3762879413052...|[-17.550128685500...|
|25394717|[aim, study, inve...|[3.11641169932197...|[-54.981726214632...|
scala> df6.write.parquet("file:///data/dir/df6")
/* This results in out of memory */