spark如何确定自动广播中要广播的数据集?

时间:2019-12-04 10:13:41

标签: performance apache-spark join spark-streaming broadcast

我是火花的新手,并使用它将流数据集(S)(来自eventhub)与批处理数据集(B)(天蓝色斑点中的镶木地板)连接起来。在使用B.persist()执行联接之前,我正在过滤B并将其持久化。我没有明确暗示要广播B。

a.join(
    b,
    conditions
)

我的自动广播加入阈值为26MB。驱动程序内存为1 GB。

B的大小为26KB。

S持续来自事件中心。

考虑到这一点,我希望火花会出现:

  • 通过广播B使用广播加入。
  • 使用广播联接以外的其他联接。

尝试广播时出现内存不足错误。据我了解,它正在广播每个流式微型批次,并且当驱动程序内存已满时,它将引发此错误:

Exception in thread "stream execution thread for _ [id = _ , runId = _]" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

spark如何确定要广播的连接的哪一侧?

我可以通过设置禁用广播

spark.sql.autoBroadcastJoinThreshold = -1 

我可以使用B广播

a.join(
    broadcast(b),
    conditions
)

但这将始终广播B。我能否暗示类似火花是否需要广播,广播B否则不广播。(从不广播A)

0 个答案:

没有答案