我有一个数据集,我从中创建了pairRDD[K,V]
v =每个密钥下的数据点数量)]
val loadInfoRDD = inputRDD.map(a => (a._1.substring(0,variabelLength),a._2)).reduceByKey(_+_)
(dr5n,108)
(dr5r4,67)
(dr5r5,1163)
(dr5r6,121)
(dr5r7,1103)
(dr5rb,93)
(dr5re8,11)
(dr5re9,190)
(dr5reb,26)
(dr5rec,38088)
(dr5red,36713)
(dr5ree,47316)
(dr5ref,131353)
(dr5reg,121227)
(dr5reh,264)
(dr5rej,163)
(dr5rek,163)
(dr5rem,229)
我需要将每个密钥分配给RDD分区,在此阶段之后,我zipWithIndex
此RDD的密钥
val partitioner = loadTree.coalesce(1).sortByKey().keys.zipWithIndex
(dr5n,0)
(dr5r4,1)
(dr5r5,2)
(dr5r6,3)
(dr5r7,4)
(dr5rb,5)
(dr5re8,6)
(dr5re9,7)
(dr5reb,8)
(dr5rec,9)
(dr5red,10)
(dr5ree,11)
(dr5ref,12)
(dr5reg,13)
(dr5reh,14)
(dr5rej,15)
(dr5rek,16)
(dr5rem,17)
但是为了在每个分区中获得更好的负载分配,我需要运行值,从key1开始(按排序顺序),并在值上计算运行总和,直到阈值值并将所有键设置为相同的值(在这种情况下,从0开始的分区号)
说,阈值= 10000,然后
(dr5n,0)
(dr5r4,0)
(dr5r5,0)
(dr5r6,0)
(dr5r7,0)
(dr5rb,0)
(dr5re8,0)
(dr5re9,0)
(dr5reb,0)(dr5rec,1)
(dr5red,2)
(dr5ree,3)
(dr5ref,4)
(dr5reg,5)
(dr5reh,6)
(dr5rej,6)
(dr5rek,6)
(dr5rem,6)
我尝试创建一个新地图,创建一组可以分组并将它们插入到新地图中的键。
有没有专家的方法来达到同样的目的?谢谢!