Question

我有sample_rdd类型RDD[(String, String, Int))]的rdd，其中包含3列id，item，count。样本数据：

id1|item1|1 id1|item2|3 id1|item3|4 id2|item1|3 id2|item4|2

我想将每个ID加入lookup_rdd这个：

输出应该给我跟随id1，outerjoin with lookuptable：

同样对于id2我应该得到：

最后，每个id的输出应该包含所有带有id的计数：

id1,1,3,4,0,0 id2,3,0,0,2,0

重要提示：此输出应始终按查询顺序排序

这就是我的尝试：

val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey() get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌dd.leftOuterJoin(ite‌m_count))

def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)

Answer 1

请尝试以下操作。在本地控制台上运行每个步骤，以了解详细情况。

想法是zipwithindex并根据lookup_rdd形成seq。 (i1,0),(i2,1)..(i5,4)和(id1,0),(id2,1)

Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5

因此生成的基本seq将为(0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))

然后根据键(i1,id1)减少并计算计数。

val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count

val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))

val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)

val res88 = res83.leftOuterJoin(res86)

val res91 = res88.map( x => {
    x._2._2 match {
       case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
       case None => (x._2._1._1, (x._1,x._2._1._2))
    }
})

val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)

res97.collect

// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

spark：根据另一个rdd的顺序加入rdd

1 个答案: