我需要对列值进行排序并在数据框中将另一列分组。
数据框中的数据如下所示。
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX02 |167499000| |
|XXXXX01 |167503000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX01 |179833000| |
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX01 |179841000| |
|XXXXX02 |179842000| 81.0|
|XXXXX03 |179843000| 87.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX03 |179865000| |
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
+--------------------+-------+
以上数据已按 TIME 列排序。
我的要求是将 NUM_ID 列分组,如下所示。
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX01 |167503000| |
|XXXXX01 |179833000| |
|XXXXX01 |179841000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX02 |167499000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX02 |179842000| 81.0|
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
|XXXXX03 |179843000| 87.0|
|XXXXX03 |179865000| |
+--------------------+-------+
NUM_ID 列现已分组,每个 NUM_ID 列的 TIME 列均已排序。
我尝试将groupBy和orderBy应用于无法正常工作的数据框。
val df2 = df1.withColumn("SIG_V", col("SIG")).orderBy("TIME").groupBy("NUM_ID")
并在df2.show时出现错误
error: value orderBy is not a member of org.apache.spark.sql.RelationalGroupedDataset
有人导致获得要求吗?
答案 0 :(得分:5)
您不需要groupBy
,只需将两列放在orderBy
中:
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.orderBy('_1,'_2).show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 3|
| 1| 4|
| 1| 10|
| 2| 0|
| 2| 2|
| 2| 5|
+---+---+