我有一个要求为groupby列添加序列ID。 原始数据集是这样的:
+------------+-----------------+-------+
| appId| rpc|elapsed|
+------------+-----------------+-------+
| account|/rpc1 | 7|
| service|/rpc4 | 0|
| service|/rpc5 | 0|
| account|/rpc1 | 78|
| api|/rpc2 | 87|
| api|/rpc2 | 52|
| service|/rpc4 | 0|
| api|/rpc3 | 52|
| service|/rpc4 | 1|
| service|/rpc4 | 0|
| service|/rpc5 | 0|
+------------+-----------------+-------+
完成dataset.select("appId", "rpc","elapsed").orderby("appId", "rpc","elapsed").show
+------------+-----------------+-------+
| appId| rpc|elapsed|
+------------+-----------------+-------+
| account|/rpc1 | 7|
| account|/rpc1 | 78|
| api|/rpc2 | 87|
| api|/rpc2 | 52|
| api|/rpc3 | 52|
| service|/rpc4 | 0|
| service|/rpc4 | 1|
| service|/rpc4 | 0|
| service|/rpc4 | 0|
| service|/rpc5 | 0|
| service|/rpc5 | 0|
+------------+-----------------+-------+
我要为分组结果添加 id列 ,如下所示:
+------------+-----------------+-------+---+
| appId| rpc|elapsed| id|
+------------+-----------------+-------+---+
| account|/rpc1 | 7| 1|
| account|/rpc1 | 78| 2|
| api|/rpc2 | 87| 1|
| api|/rpc2 | 52| 2|
| api|/rpc3 | 52| 1|
| service|/rpc4 | 0| 1|
| service|/rpc4 | 1| 2|
| service|/rpc4 | 0| 3|
| service|/rpc4 | 0| 4|
| service|/rpc5 | 0| 1|
| service|/rpc5 | 0| 2|
+------------+-----------------+-------+---+
如何实现?
答案 0 :(得分:1)
您可以使用窗口功能创建这样的ID:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
dataset
.withColumn("id",row_number().over(Window.partitionBy("appId","rpc").orderBy("elapsed"))
.show