我尝试了在http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
上找到的一个示例val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11
为什么最小长度为1?第一个分区包含[“12”,“23”],第二个分区包含[“345”,“4567”]。比较任何分区的min与初始值“”,最小值应为0.而我理解的预期结果将为00
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10
对于这个我理解相同,最终结果应为00
提前致谢。
答案 0 :(得分:3)
首先让我们看看parallelize
如何在分区之间拆分数据:
val x = sc.parallelize(List("12","23","345","4567"), 2)
x.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))
val y = sc.parallelize(List("12","23","345",""), 2)
y.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, ""))
并定义两个助手:
def seqOp(x: String, y: String) = math.min(x.length, y.length).toString
def combOp(x: String, y: String) = x + y
现在让我们跟踪x
的执行情况。忽略并行性可以表示如下:
(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "4567"))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp "0" "4567"))
(combOp "1" "1")
"11"
y
同样的事情:
(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") ""))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp "0" ""))
(combOp "1" "0")
"10"
据说你不应该在这里使用aggregate
。由于您申请的操作不是关联的,因此整个想法都是错误的。