在数组交集上对数据集的行进行分组

时间:2019-07-24 15:46:36

标签: scala dataframe apache-spark dataset

如果两行的数组字段有交集,我想对数据集的行进行分组。

case class Test(id: String, keys:Array[String])
val testDataset : Dataset[Test] // Has following data
//("id1", ["key1", "key2", "key3"])
//("id2", ["key1", "key4", "key5"])
//("id3", ["key5", "key7", "key8"])
//("id4", ["key9"])

I want the output to be, 
//Group1
[("id1", ["key1", "key2", "key3"]), ("id2", ["key1", "key4", "key5"]), ("id3", ["key5", "key7", "key8"])],
//Group2
[("id4", ["key9"])]

什么是有效的方法。

0 个答案:

没有答案