Question

我正在尝试加入两个Dataframe，一个是大约1000万条记录，另一个大约是1/3。由于小型DataFrame可以很好地适应执行程序内存，因此我执行广播连接，然后写出结果：

val df = spark.read.parquet("/plablo/data/tweets10M")
  .select("id", "content", "lat", "lon", "date")
val fullResult = FilterAndClean.performFilter(df, spark)
  .select("id", "final_tokens")
  .filter(size($"final_tokens") > 1)
val fullDFWithClean = {
  df.join(broadcast(fullResult), "id")
}
fullDFWithClean
    .write
    .partitionBy("date")
    .mode(saveMode = SaveMode.Overwrite)
    .parquet("/plablo/data/cleanTokensSpanish")

过了一会儿，我收到了这个错误：

org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:125)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FileSourceScanExec.consume(DataSourceScanExec.scala:141)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduceVectorized(DataSourceScanExec.scala:392)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduce(DataSourceScanExec.scala:315)
.....

有this个问题可以解决同一问题。在评论中提到增加“spark.sql.broadcastTimeout”可以解决问题，但是在设置一个较大的值（5000秒）之后我仍然会得到相同的错误（当然要晚得多）。

原始数据由date列分区，返回fullResult的函数执行一系列窄变换并过滤数据，因此，我假设分区已保留。

物理计划确认火花将执行BroadcastHashJoin

*Project [id#11, content#8, lat#5, lon#6, date#150, final_tokens#339]
+- *BroadcastHashJoin [id#11], [id#363], Inner, BuildRight
:- *Project [id#11, content#8, lat#5, lon#6, date#150]
:  +- *Filter isnotnull(id#11)
:     +- *FileScan parquet [lat#5,lon#6,content#8,id#11,date#150] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M], 
PartitionCount: 182, PartitionFilters: [], PushedFilters: 
[IsNotNull(id)], ReadSchema: 
struct<lat:double,lon:double,content:string,id:int>
   +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
      +- *Project [id#363, UDF(UDF(UDF(content#360))) AS 
 final_tokens#339]
     +- *Filter (((UDF(UDF(content#360)) = es) && (size(UDF(UDF(UDF(content#360)))) > 1)) && isnotnull(id#363))
        +- *FileScan parquet [content#360,id#363,date#502] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M], PartitionCount: 182, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<content:string,id:int>

我相信，考虑到我的数据大小，这个操作应该相对较快（4个执行器，每个5个核心，4g RAM在集群模式下运行在YARN上）。

感谢任何帮助

Answer 1

在这种情况下，第一个问题是您尝试播放的数据帧有多大？值得估算其大小（另请参阅this SO answer和this）。

请注意，Spark的默认spark.sql.autoBroadcastJoinThreshold为only 10Mb，因此您不应该广播非常大的数据集。

您对broadcast的使用优先，可能会迫使Spark做一些其他选择不做的事情。一个好的规则是，如果默认行为是不可接受的，那么只强制激进优化，因为积极的优化通常会产生各种边缘条件，例如您正在经历的边缘条件。

Answer 2

如果不增加spark.task.maxDirectResultSize，这也会失败。默认值为1兆字节（1m）。尝试spark.task.maxDirectResultSize=10g。

Spark Dataframe join在awaitResult中抛出异常

2 个答案: