在scala中的数据框上应用groupBy和orderBy

时间:2019-10-11 07:14:17

标签: scala dataframe apache-spark apache-spark-sql

我需要对列值进行排序并在数据框中将另一列分组。

数据框中的数据如下所示。

+------------+---------+-----+
|      NUM_ID|    TIME |SIG_V|
+------------+---------+-----+
|XXXXX01     |167499000|55   |
|XXXXX02     |167499000|     |
|XXXXX01     |167503000|     |
|XXXXX02     |179810000| 81.0|
|XXXXX02     |179811000| 81.0|
|XXXXX01     |179833000|     |
|XXXXX02     |179833000|     |
|XXXXX02     |179841000| 81.0|
|XXXXX01     |179841000|     |
|XXXXX02     |179842000| 81.0|
|XXXXX03     |179843000| 87.0|
|XXXXX02     |179849000|     |
|XXXXX02     |179850000|     |
|XXXXX01     |179850000| 88.0|
|XXXXX01     |179857000|     |
|XXXXX01     |179858000|     |
|XXXXX01     |179865000|     |
|XXXXX03     |179865000|     |
|XXXXX02     |179870000|     |
|XXXXX02     |179871000| 11  |
+--------------------+-------+

以上数据已按 TIME 列排序。

我的要求是将 NUM_ID 列分组,如下所示。

+------------+---------+-----+
|      NUM_ID|    TIME |SIG_V|
+------------+---------+-----+
|XXXXX01     |167499000|55   |
|XXXXX01     |167503000|     |
|XXXXX01     |179833000|     |
|XXXXX01     |179841000|     |
|XXXXX01     |179850000| 88.0|
|XXXXX01     |179857000|     |
|XXXXX01     |179858000|     |
|XXXXX01     |179865000|     |
|XXXXX02     |167499000|     |
|XXXXX02     |179810000| 81.0|
|XXXXX02     |179811000| 81.0|
|XXXXX02     |179833000|     |
|XXXXX02     |179841000| 81.0|
|XXXXX02     |179849000|     |
|XXXXX02     |179850000|     |
|XXXXX02     |179842000| 81.0|
|XXXXX02     |179870000|     |
|XXXXX02     |179871000| 11  |
|XXXXX03     |179843000| 87.0|
|XXXXX03     |179865000|     |
+--------------------+-------+

NUM_ID 列现已分组,每个 NUM_ID 列的 TIME 列均已排序。

我尝试将groupBy和orderBy应用于无法正常工作的数据框。

val df2 =  df1.withColumn("SIG_V", col("SIG")).orderBy("TIME").groupBy("NUM_ID")

并在df2.show时出现错误

error: value orderBy is not a member of org.apache.spark.sql.RelationalGroupedDataset

有人导致获得要求吗?

1 个答案:

答案 0 :(得分:5)

您不需要groupBy,只需将两列放在orderBy中:

scala> df.show()
+---+---+
| _1| _2|
+---+---+
|  1|  3|
|  2|  2|
|  1|  4|
|  1|  1|
|  2|  0|
|  1| 10|
|  2|  5|
+---+---+


scala> df.orderBy('_1,'_2).show()
+---+---+
| _1| _2|
+---+---+
|  1|  1|
|  1|  3|
|  1|  4|
|  1| 10|
|  2|  0|
|  2|  2|
|  2|  5|
+---+---+