压平RDD以使用RDD在Spark中获取非重复值对

时间:2016-03-30 02:09:59

标签: apache-spark

考虑我在scala中的数据框中的模式。

root
  |-- phonetic: string (nullable = true)
  |-- sigID: long (nullable = true)

我基本上是通过拼音分组。

featuers.rdd.groupBy(x => x.apply(0))

这将给我一个rdd

(abc,([1],[2],[3]))
(def,([9],[8]))

如何将其展平以获得(key,([value-a,value-b]))的笛卡儿

abc,1,2
abc,1,3
abc,2,3
def,9,8
....

由于

2 个答案:

答案 0 :(得分:1)

您可以将其保留为DataFrame并执行此操作:

val df: DataFrame = ...

df.as("df1").join(
  df.as("df2"),
  ($"df2.phonetic" === $"df1.phonetic") && ($"df1.sigID" !== $"df2.sigID")
).select($"df1.phonetic", $"df1.sigID", $"df2.sigID").show

答案 1 :(得分:1)

顺便说一句,要回答原始问题,您可以像这样展开分组数据:

df.rdd.groupBy(x => x.apply(0)).flatMap(t => {
  val longs = t._2.toArray.map(r => r.getLong(1));
  longs.flatMap(l => longs.flatMap(l2 => {
    if (l != l2) Seq((t._1, l, l2));
    else Seq() 
  }))
}).collect

res35: Array[(Any, Long, Long)] = Array((def,9,8), (def,8,9), (abc,1,2), (abc,1,3), (abc,2,1), (abc,2,3), (abc,3,1), (abc,3,2))