我在Spark中有这个数据结构:
val df = Seq(
("Package 1", Seq("address1", "address2", "address3")),
("Package 2", Seq("address3", "address4", "address5", "address6")),
("Package 3", Seq("address7", "address8")),
("Package 4", Seq("address9")),
("Package 5", Seq("address9", "address1")),
("Package 6", Seq("address10")),
("Package 7", Seq("address8"))).toDF("Package", "Destinations")
df.show(20, false)
我需要找到在不同包中一起看到的所有地址。看起来我找不到有效地做到这一点的方法。我试图分组,映射等。理想情况下,给定df
的结果将是
+----+------------------------------------------------------------------------+
| Id | Addresses |
+----+------------------------------------------------------------------------+
| 1 | [address1, address2, address3, address4, address5, address6, address9] |
| 2 | [address7, address8] |
| 3 | [address10] |
+----+------------------------------------------------------------------------+
答案 0 :(得分:2)
使用TreeReduce
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/rdd/RDD.html#treeReduce(scala.Function2,%20int)
对于sequential
操作,您可以创建一组集:
对于每个新的元素数组,例如[address 7
,address 8
] -
迭代现有集合以检查交集是否为非空:如果是,则将这些元素添加到该集合
对于combine
操作:
注意 TreeReduce
是较新的命名。 TreeAggregate
用于旧版本的Spark