我是火花的新手,并使用它将流数据集(S)(来自eventhub)与批处理数据集(B)(天蓝色斑点中的镶木地板)连接起来。在使用B.persist()执行联接之前,我正在过滤B并将其持久化。我没有明确暗示要广播B。
a.join(
b,
conditions
)
我的自动广播加入阈值为26MB。驱动程序内存为1 GB。
B的大小为26KB。
S持续来自事件中心。
考虑到这一点,我希望火花会出现:
尝试广播时出现内存不足错误。据我了解,它正在广播每个流式微型批次,并且当驱动程序内存已满时,它将引发此错误:
Exception in thread "stream execution thread for _ [id = _ , runId = _]" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
spark如何确定要广播的连接的哪一侧?
我可以通过设置禁用广播
spark.sql.autoBroadcastJoinThreshold = -1
我可以使用B广播
a.join(
broadcast(b),
conditions
)
但这将始终广播B。我能否暗示类似火花是否需要广播,广播B否则不广播。(从不广播A)