数据框输入
+-----------------+-------+
|Id | value |
+-----------------+-------+
| 1622| 139685|
| 1622| 182118|
| 1622| 127955|
| 3837|3224815|
| 1622| 727761|
| 1622| 155875|
| 3837|1504923|
| 1622| 139684|
| 1453| 536111|
+-----------------+-------+
输出:
+-----------------+--------------------------------------------+
|Id | value |
+-----------------+--------------------------------------------+
| 1622|[139685,182118,127955,727761,155875,139684] |
| 1453| 536111 |
| 3837|[3224815,1504923] |
+-----------------+--------------------------------------------+
当特定的id
具有多个值时,则应以array
格式收集,否则
应该将其作为单个值without
大括号[]
我尝试了以下链接解决方案,但无法处理数据框中的if-else条件。
答案 0 :(得分:2)
使用窗口功能
scala> import org.apache.spark.sql.expressions.Window
scala> var df = Seq((1622, 139685),(1622, 182118),(1622, 127955),(3837,3224815),(1622, 727761),(1622, 155875),(3837,1504923),(1622, 139684),(1453, 536111)).toDF("id","value")
scala> df.show()
+----+-------+
| id| value|
+----+-------+
|1622| 139685|
|1622| 182118|
|1622| 127955|
|3837|3224815|
|1622| 727761|
|1622| 155875|
|3837|1504923|
|1622| 139684|
|1453| 536111|
+----+-------+
scala> var df1= df.withColumn("r",count($"id").over(Window.partitionBy("id").orderBy("id")).cast("int"))
scala> df1.show()
+----+-------+---+
| id| value| r|
+----+-------+---+
|1453| 536111| 1|
|1622| 139685| 6|
|1622| 182118| 6|
|1622| 127955| 6|
|1622| 727761| 6|
|1622| 155875| 6|
|1622| 139684| 6|
|3837|3224815| 2|
|3837|1504923| 2|
+----+-------+---+
scala> var df2 =df1.selectExpr("*").filter('r ===1).drop("r").union(df1.filter('r =!= 1).groupBy("id").agg(collect_list($"value").cast("string").as("value")))
scala> df2.show(false)
+----+------------------------------------------------+
|id |value |
+----+------------------------------------------------+
|1453|536111 |
|1622|[139685, 182118, 127955, 727761, 155875, 139684]|
|3837|[3224815, 1504923] |
+----+------------------------------------------------+
scala> df2.printSchema
root
|-- id: integer (nullable = false)
|-- value: string (nullable = true)
如果您有任何与此相关的问题,请告诉我。