Question

我是火花的新手，并使用它将流数据集（S）（来自eventhub）与批处理数据集（B）（天蓝色斑点中的镶木地板）连接起来。在使用B.persist（）执行联接之前，我正在过滤B并将其持久化。我没有明确暗示要广播B。

a.join(
    b,
    conditions
)

我的自动广播加入阈值为26MB。驱动程序内存为1 GB。

B的大小为26KB。

S持续来自事件中心。

考虑到这一点，我希望火花会出现：

通过广播B使用广播加入。
使用广播联接以外的其他联接。

尝试广播时出现内存不足错误。据我了解，它正在广播每个流式微型批次，并且当驱动程序内存已满时，它将引发此错误：

Exception in thread "stream execution thread for _ [id = _ , runId = _]" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

spark如何确定要广播的连接的哪一侧？

我可以通过设置禁用广播

spark.sql.autoBroadcastJoinThreshold = -1

我可以使用B广播

a.join(
    broadcast(b),
    conditions
)

但这将始终广播B。我能否暗示类似火花是否需要广播，广播B否则不广播。（从不广播A）

spark如何确定自动广播中要广播的数据集？

0 个答案: