是火花缓存()在这种情况下导致驱动程序的collect()吗?

时间:2018-05-22 23:49:44

标签: apache-spark

需要一些帮助来解释Spark的一些错误日志。我的理解是缓存不应该触发所有数据发送到驱动程序。我有一个缩写的stacktrace,看起来像这样:

Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult: 
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
   ...
  at org.apache.spark.sql.Dataset.persist(Dataset.scala:2902)
  at org.apache.spark.sql.Dataset.cache(Dataset.scala:2912)
   ...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1076.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1750)

   ...
  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
  at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:304)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
  at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:97)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
   ...
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

看起来缓存启动广播然后最终调用RDD上的收集,然后触发“由于阶段失败导致作业中止:16个任务的序列化结果的总大小(1076.4 MB)大于spark。 driver.maxResultSize(1024.0 MB)“错误。

为什么我看到这个错误让我感到有点困惑 - 我读到的关于.cache的内容是它在节点中持续存在而不必将所有数据都移动到驱动程序中。

代码看起来像这样。我们有一份工作,通过visit_id汇总一系列事件。它读取事件,投影一些字段,然后像这样汇总它们:

  def aggregateRows: sql.DataFrame = {
    projected
      .orderBy("headerTimestamp")
      .groupBy(groupBys.head, groupBys.tail: _*)
      .agg(
        first("accountState", ignoreNulls = true).alias("accountState"),
        first("userId", ignoreNulls = true).alias("userId"),
        first("subaffiliateId", ignoreNulls = true).alias("subaffiliateId"),
        first("clientPlatform", ignoreNulls = true).alias("clientPlatform"),
        first("localTimestamp", ignoreNulls = true).alias("localTimestamp"),
        first("page", ignoreNulls = true).alias("firstPage")
      )
  }

(顺便说一句,我觉得这个代码不正确w / r / t得到第一行因为groupBy显然不维护排序,但这是我在收到此错误时运行的代码)< / p>

然后我们加入user_id这样的访问汇总(我们使用createOrReplaceTempView和spark sql创建一个名为“visits”的临时视图):

  SELECT
    u.days_since_last_active,
    u.user_id,
    v.appName as app_name,
    v.clientPlatform as client_platform,
    v.countryCode as country_code,
    v.llChannel as ll_channel,
    v.llSource as ll_source,
    v.referralKey as referral_key,
    v.visitTimestamp as resurrection_time,
    v.subaffiliateId as subaffiliateId,
    v.visitDate as resurrection_date,
    v.accountState as account_state,
    v.ipAddress as ip_address,
    v.localTimestamp as resurrection_local_time,
    v.visitId as visit_id,
    v.firstPage as resurrection_page,
    row_number() OVER (PARTITION BY u.days_since_last_active, u.user_id ORDER BY v.visitTimestamp) as rn
  FROM ubdm u
  LEFT OUTER JOIN visits v ON v.userId = u.user_id
    AND u.date = '$dateStr'
    AND (u.days_since_last_active > 30
      OR (u.days_since_signup > 30 AND u.days_since_last_active IS NULL))

然后我们在上面调用cache,然后将数据帧写为tsv和镶木地板

val cached = generateDataFrame().cache()

writeParquet(cached.write, parquetPath)
writeTsv(cached.write, tsvPath)

.write返回DataFrameWriter。最后,例如,对于实木复合地板,我们在DataFrameWriter

上调用以下内容
  def writeParquet[A](df: DataFrameWriter[A], outputPath: String, saveMode: SaveMode = SaveMode.Overwrite): Unit = {
    df.mode(saveMode)
      .parquet(outputPath)
  }

2 个答案:

答案 0 :(得分:0)

据我所知 - 一切正常。不,cache不会触发collect

记住 - spark有transformationsactionsTransformations仅由actions触发。 collect是一项操作,触发了rdd的计算,其中cache介于两者之间。

您收到错误是因为您尝试collect太多数据(不适合您的驱动程序节点)。

P.S。如果你可以分享代码 - 那会很好,顺便说一句。

答案 1 :(得分:0)

在Spark 2.3上,cache() 确实触发在驱动程序上收集广播数据。这是bug (SPARK-23880)-已在2.4.0版中修复。

关于转换与动作:一些Spark转换涉及其他动作,例如sortByKey在RDD上。因此,将所有Spark操作划分为转换或操作都有些过分简化。