Question

我们在工作中使用Spark来完成一些批处理作业，但是现在我们正在加载更大的数据集，Spark正在抛出java.lang.OutOfMemory错误。我们使用Yarn作为资源管理器运行，但是在客户端模式下运行。

驱动程序内存= 64gb
驱动程序核心= 8
执行者= 8
执行者记忆= 20gb
执行者核心= 5
部署模式=客户端

驱动程序首先对其他服务进行许多网络查询以获取数据，然后将其写入压缩拼接块（使用Spark）到主批处理目录的许多临时子目录中。总的来说，我们正在谈论100,000,000个案例类，其中约有40个属性使用Spark写入压缩镶木地板 - ＆gt; dataframe - ＆gt;实木复合地板。这部分工作正常。

完成这项工作后，我们有两组数据：人和动物。我们需要和他们所属的人一起加入动物。但首先，我们清理了一堆动物日期栏，以确保它们看起来正确（生日，接种日期等）。

每个动物或人子目录中的内存量并不大。它实际上非常小 - 每个目录大约100kb，大约800个目录。再次，它是一个snappy压缩的实木复合地板。

以前，联接目录的最终大小为130 兆字节。我认为，根据我们的新数据，我们的数据大约只有两倍，但我不了解在此过程中我们是如何遇到内存问题的。

我非常感谢任何人提供的任何智慧。

以下是此加入过程中使用的一些方法：

  def fromHdfs(path) {
    // Construct Seq of sub-directories with data...
    val dataframes = dirs.map(dir => spark.read.format("parquet").option("header", "true").load(s"$root/$dir"))
    // Concatenate each dataframe into one resulting df
    dataframes.reduce(_ union _)
  }

  private def convertDates(df: DataFrame, dateCols: Seq[String]): DataFrame = {
   // Clean up dates from a pre-determined list of 'dateCol' strings
   df.columns.intersect(dateCols).foldLeft(df)((newDf, col) =>
      newDf
        .withColumn(col, unix_timestamp(df(col).cast("string"), "yyyyMMdd")
          .cast("timestamp")
          .cast("date")
        ))
  }

  // Join the people and animals dataframes on 'id'
  def peopleWithAnimals(people: DataFrame, animals: DataFrame)(implicit spark: SparkSession): DataFrame = {
    // The only collect in here, just to get the column names to foldLeft over
    val cols = animals.select("animal").distinct.select("animal").rdd.map(r => r(0)).collect().map(_.toString)
    val animalsReshaped = cols.foldLeft(animals) { (newDf, colName) =>
      newDf.withColumn(colName, when($"animal" === colName, animals("value")).otherwise(null))
    }
    val peopleNoDups = people.dropDuplicates()
    val animalsNoDups = animalsReshaped.dropDuplicates()
    convertDates(peopleNoDups.join(animalsNoDups, "id"), dateCols)
  }

上述方法用于最终方法，其全部内容如下：

def writeJoined(....) = {
  // Read all of the animal and people data subdirectories into dataframes
  val animals = Read.animalsFromHdfs(sourcePath)
  val people = Read.ownersFromHdfs(sourcePath)

  // Join the dataframes on id
  val joined = Join.peopleWithAnimals(animals, people)

  // Write the result to HDFS.
  joined.write.option("header", "true").parquet(destinationPath)
}

我们的应用程序现在可以创建连接数据将被写入的临时目录，但在此之后的某个地方耗尽内存。

18/06/14 20:37:39 INFO scheduler.TaskSetManager: Finished task 1196.0 in stage 1320.0 (TID 61383) in 1388 ms on machine-name (executor 8) (1242/24967)
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/06/14 20:38:39 - please wait.
JVMDUMP032I JVM requested System dump using '/appfolder/core.20180614.203839.27070.0001.dmp' in response to an event

和

java.lang.OutOfMemoryErrorException in thread "Spark Context Cleaner" Exception in thread "DataStreamer for file /spark2-history/application_1527888960577_0084.inprogress block BP-864704807-49.70.7.82-1525204078667:blk_1073896985_156181" Exception in thread "SparkListenerBus" Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Exception in thread "RequestSenderProxy-ts" Java heap space
java.lang.OutOfMemoryError: Exception in thread "task-result-getter-0"
 Exception in thread "task-result-getter-3" Exception in thread "heartbeat-receiver-event-loop-thread" 
Exception in thread "LeaseRenewer:valrs_dev_usr@companyhost:companyport" java/lang/OutOfMemoryError: Java heap space

由于OOM

0 个答案: