我目前有一个这样的数据框
+------------+----------+----------+
| mac|time |s |
+------------+----------+----------+
|aaaaaaaaaaaa|11 |a |
|aaaaaaaaaaaa|44 |c |
|bbbbbbbbbbbb|22 |b |
|aaaaaaaaaaaa|33 |a |
+------------+----------+----------+
我想使用.rdd函数,并按“ mac”列分组并按“ time”列排序,这是一个示例
res5: Array[(Any, Iterable[(Any, Any)])] = Array((aaaaaaaaaaaa,CompactBuffer((11,a),(33,a),(44,c))), (bbbbbbbbbbbb,CompactBuffer((22,b))))
我已经可以对“ mac”列进行分组,但仍然无法按“时间”进行排序
df.rdd.map(x=>(x(0),(x(1),x(2)))).groupByKey()
我该怎么做?
答案 0 :(得分:0)
您可以执行以下操作:
scala> val df = Seq(
| ("aaaaaaaaaaaa", 11, "a"),("aaaaaaaaaaaa", 44, "c"),("bbbbbbbbbbb", 22, "b"),("aaaaaaaaaaaa", 33, "a")
| ).toDF("mac", "time","s")
scala> df.rdd.sortBy(_.apply(1).toString).groupBy(_.apply(0)).collect
res38: Array[(Any, Iterable[org.apache.spark.sql.Row])] = Array((aaaaaaaaaaaa,CompactBuffer([aaaaaaaaaaaa,11,a], [aaaaaaaaaaaa,33,a], [aaaaaaaaaaaa,44,c])), (bbbbbbbbbbbb,CompactBuffer([bbbbbbbbbbbb,22,b])))
谢谢
答案 1 :(得分:0)
df.rdd.map(x=>(x(0),(x(1),x(2)))).groupByKey()
.mapValues(_.toSeq.sortBy(_._1.asInstanceOf[Int]))