我的数据框类似于以下内容:
val df = sc.parallelize(Seq((100, 1, 1), (100, 1,2), (100, 2,3), (200, 1,1), (200, 2,3), (200, 2, 2), (200, 3, 1), (200, 3,2), (300, 1,1), (300,1,2), (300, 2,5), (400, 1, 6))).toDF("_c0", "_c1", "_c2")
+---+---+--------------------+
|_c0|_c1| _c2|
+---+---+--------------------+
|100| 1|1 |
|100| 1|2 |
|100| 2|3 |
|200| 1|1 |
|200| 2|3 |
|200| 2|2 |
|200| 3|1 |
|200| 3|2 |
|300| 1|1 |
|300| 1|2 |
|300| 2|5 |
|400| 1|6 |
我需要groupBy _c0和_c1并得到一些像这样的rdd:
res9: Array[Array[Array[Int]]] = Array(Array(Array(1, 2), Array(3)), Array(Array(1), Array(3, 2), Array(1, 2)), Array(Array(1, 2), Array(5)), Array(Array(6)))
这是一个数组数组,我是scala的新手。请尽量帮助
答案 0 :(得分:2)
您可以先groupBy
_c0
和_c1
,然后groupBy
_c1
来获得所需的结果。以下是相同的代码。
//first group by "_c0" and "_c1"
val res = df.groupBy("_c0", "_c1").agg(collect_list("_c2").as("_c2"))
//group by "_c0"
.groupBy("_c0").agg(collect_list("_c2").as("_c2"))
.select("_c2")
res.show(false)
//output
//+---------------------------------------------------------+
//|_c2 |
//+---------------------------------------------------------+
//|[WrappedArray(1, 2), WrappedArray(5)] |
//|[WrappedArray(1, 2), WrappedArray(3)] |
//|[WrappedArray(6)] |
//|[WrappedArray(3, 2), WrappedArray(1, 2), WrappedArray(1)]|
//+---------------------------------------------------------+
要将此转换为RDD
,请使用.rdd
生成dataframe
。
import scala.collection.mutable.WrappedArray
val rdd = res.rdd.map(x => x.get(0)
.asInstanceOf[WrappedArray[WrappedArray[Int]]].array.map(x => x.toArray))
//to get the content or rdd(Don't use it if data is too big)
rdd.collect()
//output
//Array(Array(Array(1, 2), Array(5)), Array(Array(1, 2), Array(3)), Array(Array(6)), Array(Array(3, 2), Array(1, 2), Array(1)))