Question

我在Spark广播变量中看到的所有示例都在使用它们的函数范围内定义它们（map()，join()等）。我想同时使用引用广播变量的map()函数和mapPartitions()函数，但我想将它们模块化，以便我可以使用相同的函数进行单元测试。

我该如何做到这一点？

我的想法是讨论函数，以便在使用map或mapPartitions调用时传递对广播变量的引用。

在原始范围内定义函数时，传递对广播变量的引用是否有任何性能影响？

我有类似的想法（伪代码）：

// firstFile.scala
// ---------------

def mapper(bcast: Broadcast)(row: SomeRow): Int = {
  bcast.value(row._1)
}

def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
  val broadcastVariable = bcast.value

  for {
    i <- iter
  } yield broadcastVariable(i)
})


// secondFile.scala
// ----------------

import firstFile.{mapMyPartition, mapper}

val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))

rdd
 .map(mapper(bcastVariable))
 .mapPartitions(mapMyPartition(bcastVariable))

Answer 1

您的解决方案应该可以正常运行。在这两种情况下，传递给map{Partitions}的函数将在序列化时包含对广播变量本身的引用，但不包含对其值的引用，并且仅在节点上计算时调用bcast.value。

需要避免的是

def mapper(bcast: Broadcast): SomeRow => Int = {
  val value = bcast.value
  row => value(row._1)
}

Answer 2

你这样做是正确的。您只需记住传递广播参考而不是值本身。使用您的示例，差异可能如下所示：

a）有效的方式：

flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES, login_hint = 'a_new_account@gmail.com')

b）效率低下的方式：

// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3)) 

rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker

当然，在第二个示例中，// the whole Map[Int, Int] is serialized and sent to every worker val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3)) rdd .map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker .mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker和mapper的签名会略有不同。

如何引用范围之外的Spark广播变量

2 个答案: