知道spark在每个工作节点上使用多个执行程序,并且每个执行程序都在其自己的JVM中运行,所以我想知道/ if spark如何优化广播变量的网络流量。希望它对每个工作节点进行一次下载,然后将已经序列化的数据发送到该特定节点上的执行器。另一种选择是每次执行者需要它时都下载广播的数据(因此必须在特定节点上多次下载相同的数据)。
答案 0 :(得分:1)
是的,Spark确实使用洪流广播来优化广播。引用the source
* A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]]. * * The mechanism is as follows: * * The driver divides the serialized object into small chunks and * stores those chunks in the BlockManager of the driver. * * On each executor, the executor first attempts to fetch the object from its BlockManager. If * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or * other executors if available. Once it gets the chunks, it puts the chunks in its own * BlockManager, ready for other executors to fetch from. * * This prevents the driver from being the bottleneck in sending out multiple copies of the * broadcast data (one per executor).
过去,有另一个广播实现(HTTP广播),但是在2.0中已被完全删除。