我们在工作中使用Spark来完成一些批处理作业,但是现在我们正在加载更大的数据集,Spark正在抛出java.lang.OutOfMemory错误。我们使用Yarn作为资源管理器运行,但是在客户端模式下运行。
驱动程序首先对其他服务进行许多网络查询以获取数据,然后将其写入压缩拼接块(使用Spark)到主批处理目录的许多临时子目录中。总的来说,我们正在谈论100,000,000个案例类,其中约有40个属性使用Spark写入压缩镶木地板 - > dataframe - >实木复合地板。这部分工作正常。
完成这项工作后,我们有两组数据:人和动物。我们需要和他们所属的人一起加入动物。但首先,我们清理了一堆动物日期栏,以确保它们看起来正确(生日,接种日期等)。
每个动物或人子目录中的内存量并不大。它实际上非常小 - 每个目录大约100kb,大约800个目录。再次,它是一个snappy压缩的实木复合地板。
以前,联接目录的最终大小为130 兆字节。我认为,根据我们的新数据,我们的数据大约只有两倍,但我不了解在此过程中我们是如何遇到内存问题的。
我非常感谢任何人提供的任何智慧。
以下是此加入过程中使用的一些方法:
def fromHdfs(path) {
// Construct Seq of sub-directories with data...
val dataframes = dirs.map(dir => spark.read.format("parquet").option("header", "true").load(s"$root/$dir"))
// Concatenate each dataframe into one resulting df
dataframes.reduce(_ union _)
}
private def convertDates(df: DataFrame, dateCols: Seq[String]): DataFrame = {
// Clean up dates from a pre-determined list of 'dateCol' strings
df.columns.intersect(dateCols).foldLeft(df)((newDf, col) =>
newDf
.withColumn(col, unix_timestamp(df(col).cast("string"), "yyyyMMdd")
.cast("timestamp")
.cast("date")
))
}
// Join the people and animals dataframes on 'id'
def peopleWithAnimals(people: DataFrame, animals: DataFrame)(implicit spark: SparkSession): DataFrame = {
// The only collect in here, just to get the column names to foldLeft over
val cols = animals.select("animal").distinct.select("animal").rdd.map(r => r(0)).collect().map(_.toString)
val animalsReshaped = cols.foldLeft(animals) { (newDf, colName) =>
newDf.withColumn(colName, when($"animal" === colName, animals("value")).otherwise(null))
}
val peopleNoDups = people.dropDuplicates()
val animalsNoDups = animalsReshaped.dropDuplicates()
convertDates(peopleNoDups.join(animalsNoDups, "id"), dateCols)
}
上述方法用于最终方法,其全部内容如下:
def writeJoined(....) = {
// Read all of the animal and people data subdirectories into dataframes
val animals = Read.animalsFromHdfs(sourcePath)
val people = Read.ownersFromHdfs(sourcePath)
// Join the dataframes on id
val joined = Join.peopleWithAnimals(animals, people)
// Write the result to HDFS.
joined.write.option("header", "true").parquet(destinationPath)
}
我们的应用程序现在可以创建连接数据将被写入的临时目录,但在此之后的某个地方耗尽内存。
18/06/14 20:37:39 INFO scheduler.TaskSetManager: Finished task 1196.0 in stage 1320.0 (TID 61383) in 1388 ms on machine-name (executor 8) (1242/24967)
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/06/14 20:38:39 - please wait.
JVMDUMP032I JVM requested System dump using '/appfolder/core.20180614.203839.27070.0001.dmp' in response to an event
和
java.lang.OutOfMemoryErrorException in thread "Spark Context Cleaner" Exception in thread "DataStreamer for file /spark2-history/application_1527888960577_0084.inprogress block BP-864704807-49.70.7.82-1525204078667:blk_1073896985_156181" Exception in thread "SparkListenerBus" Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Exception in thread "RequestSenderProxy-ts" Java heap space
java.lang.OutOfMemoryError: Exception in thread "task-result-getter-0"
Exception in thread "task-result-getter-3" Exception in thread "heartbeat-receiver-event-loop-thread"
Exception in thread "LeaseRenewer:valrs_dev_usr@companyhost:companyport" java/lang/OutOfMemoryError: Java heap space