Question

知道spark在每个工作节点上使用多个执行程序，并且每个执行程序都在其自己的JVM中运行，所以我想知道/ if spark如何优化广播变量的网络流量。希望它对每个工作节点进行一次下载，然后将已经序列化的数据发送到该特定节点上的执行器。另一种选择是每次执行者需要它时都下载广播的数据（因此必须在特定节点上多次下载相同的数据）。

Answer 1

是的，Spark确实使用洪流广播来优化广播。引用the source

* A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].
*
* The mechanism is as follows:
*
* The driver divides the serialized object into small chunks and
* stores those chunks in the BlockManager of the driver.
*
* On each executor, the executor first attempts to fetch the object from its BlockManager. If
* it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
* other executors if available. Once it gets the chunks, it puts the chunks in its own
* BlockManager, ready for other executors to fetch from.
*
* This prevents the driver from being the bottleneck in sending out multiple copies of the
* broadcast data (one per executor).

过去，有另一个广播实现（HTTP广播），但是在2.0中已被完全删除。

火花会优化广播变量的网络流量吗？

1 个答案: