任何人都可以用简单的术语解释CoGroupedRDD的作用吗?下面的代码在两个RDD之间进行连接
(tensorflow)C:> pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow-1.0.0-cp35-cp35m-win_x86_64.whl
tensorflow-1.0.0-cp35-cp35m-win_x86_64.whl is not a supported wheel on this platform. Could anyone tell me the reason behind this?
答案 0 :(得分:0)
最简单的形式cogroup
有以下签名:
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
其中" self"是RDD[(K, V)]
。简单来说,它需要键值对的RDD和按键分组值,保持每个源的值在逻辑上分开:
val rdd1 = sc.parallelize(Seq((1, "foo"), (1, "bar"), (2, "foobar")))
val rdd2 = sc.parallelize(Seq((1, 1), (1, 2), (3, 3)))
rdd1.cogroup(rdd2).collect.foreach(println)
(1,(CompactBuffer(foo, bar),CompactBuffer(1, 2)))
(2,(CompactBuffer(foobar),CompactBuffer()))
(3,(CompactBuffer(),CompactBuffer(3)))
此机制用于实现joins
。一旦数据被共同分组,您就可以将其展平
for { lv <- lss; rv <- rvs } yield (key, (lv, rv))
完成内部联接。外连接遵循相同的过程,对空序列进行小的调整。