将成对的键值与成对的键映射相结合

时间:2017-09-26 10:04:12

标签: scala apache-spark

我正在使用数据集

(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)

其他数据集是:

(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))

我的目标是合并两个集合来获取:

(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))

代码我有:

...  
val counts_firstDataset = words.map(word => 
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}

第二个数据集:

...
val counts_secondDataset  = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))

我尝试使用加入方法 val joined_data = counts_firstDataset.join(counts_secondDataset)但是没有效果,因为连接需要[K,V]对。我该如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

最简单的方法是转换为DataFrames,然后转换为join

import spark.implicits._
val counts_firstDataset = words
  .map(word => (word.firstWord, 1))
  .reduceByKey{case (x, y) => x + y}
  .toDF("type", "value")

val counts_secondDataset = secondSet
  .map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
  .toDF("type_2","map")

counts_firstDataset
  .join(counts_secondDataset, 'type === 'type_2)
  .drop('type_2)

答案 1 :(得分:1)

由于两个列表的第一个元素(水果名称)的顺序相同,您可以使用 zip 组合两个元组列表,然后使用 map 来通过以下方式将列表更改为元组:

counts_firstDataset.zip(counts_secondDataset)
  .map(vk => (vk._1._1, vk._1._2, vk._2._2))