我正在使用数据集:
(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)
,其他数据集是:
(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))
我的目标是合并两个集合来获取:
(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))
代码我有:
...
val counts_firstDataset = words.map(word =>
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}
第二个数据集:
...
val counts_secondDataset = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))
我尝试使用加入方法 val joined_data = counts_firstDataset.join(counts_secondDataset)
但是没有效果,因为连接需要[K,V]对。我该如何解决这个问题?
答案 0 :(得分:1)
最简单的方法是转换为DataFrames
,然后转换为join
:
import spark.implicits._
val counts_firstDataset = words
.map(word => (word.firstWord, 1))
.reduceByKey{case (x, y) => x + y}
.toDF("type", "value")
val counts_secondDataset = secondSet
.map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
.toDF("type_2","map")
counts_firstDataset
.join(counts_secondDataset, 'type === 'type_2)
.drop('type_2)
答案 1 :(得分:1)
由于两个列表的第一个元素(水果名称)的顺序相同,您可以使用 zip 组合两个元组列表,然后使用 map 来通过以下方式将列表更改为元组:
counts_firstDataset.zip(counts_secondDataset)
.map(vk => (vk._1._1, vk._1._2, vk._2._2))