考虑我在scala中的数据框中的模式。
root
|-- phonetic: string (nullable = true)
|-- sigID: long (nullable = true)
我基本上是通过拼音分组。
featuers.rdd.groupBy(x => x.apply(0))
这将给我一个rdd
(abc,([1],[2],[3]))
(def,([9],[8]))
如何将其展平以获得(key,([value-a,value-b]))的笛卡儿
abc,1,2
abc,1,3
abc,2,3
def,9,8
....
由于
答案 0 :(得分:1)
您可以将其保留为DataFrame
并执行此操作:
val df: DataFrame = ...
df.as("df1").join(
df.as("df2"),
($"df2.phonetic" === $"df1.phonetic") && ($"df1.sigID" !== $"df2.sigID")
).select($"df1.phonetic", $"df1.sigID", $"df2.sigID").show
答案 1 :(得分:1)
顺便说一句,要回答原始问题,您可以像这样展开分组数据:
df.rdd.groupBy(x => x.apply(0)).flatMap(t => {
val longs = t._2.toArray.map(r => r.getLong(1));
longs.flatMap(l => longs.flatMap(l2 => {
if (l != l2) Seq((t._1, l, l2));
else Seq()
}))
}).collect
res35: Array[(Any, Long, Long)] = Array((def,9,8), (def,8,9), (abc,1,2), (abc,1,3), (abc,2,1), (abc,2,3), (abc,3,1), (abc,3,2))