汇总值的Spark数据框

时间:2019-10-31 09:26:40

标签: scala dataframe apache-spark apache-spark-sql

数据框输入

+-----------------+-------+
|Id               | value |
+-----------------+-------+
|             1622| 139685|
|             1622| 182118|
|             1622| 127955|
|             3837|3224815|
|             1622| 727761|
|             1622| 155875|
|             3837|1504923|
|             1622| 139684|
|             1453| 536111|
+-----------------+-------+

输出:

    +-----------------+--------------------------------------------+
    |Id               | value                                      |
    +-----------------+--------------------------------------------+
    |             1622|[139685,182118,127955,727761,155875,139684] |
    |             1453| 536111                                     |
    |             3837|[3224815,1504923]                           |
    +-----------------+--------------------------------------------+


当特定的id具有多个值时,则应以array格式收集,否则 应该将其作为单个值without大括号[]

我尝试了以下链接解决方案,但无法处理数据框中的if-else条件。

链接:Spark DataFrame aggregate column values by key into List

1 个答案:

答案 0 :(得分:2)

  

使用窗口功能

scala> import org.apache.spark.sql.expressions.Window
scala> var df = Seq((1622, 139685),(1622, 182118),(1622, 127955),(3837,3224815),(1622, 727761),(1622, 155875),(3837,1504923),(1622, 139684),(1453, 536111)).toDF("id","value")

scala> df.show()
+----+-------+
|  id|  value|
+----+-------+
|1622| 139685|
|1622| 182118|
|1622| 127955|
|3837|3224815|
|1622| 727761|
|1622| 155875|
|3837|1504923|
|1622| 139684|
|1453| 536111|
+----+-------+

scala> var df1= df.withColumn("r",count($"id").over(Window.partitionBy("id").orderBy("id")).cast("int"))

scala> df1.show()
+----+-------+---+
|  id|  value|  r|
+----+-------+---+
|1453| 536111|  1|
|1622| 139685|  6|
|1622| 182118|  6|
|1622| 127955|  6|
|1622| 727761|  6|
|1622| 155875|  6|
|1622| 139684|  6|
|3837|3224815|  2|
|3837|1504923|  2|
+----+-------+---+
scala> var df2 =df1.selectExpr("*").filter('r ===1).drop("r").union(df1.filter('r =!= 1).groupBy("id").agg(collect_list($"value").cast("string").as("value")))


scala> df2.show(false)
+----+------------------------------------------------+
|id  |value                                           |
+----+------------------------------------------------+
|1453|536111                                          |
|1622|[139685, 182118, 127955, 727761, 155875, 139684]|
|3837|[3224815, 1504923]                              |
+----+------------------------------------------------+
scala> df2.printSchema
root
 |-- id: integer (nullable = false)
 |-- value: string (nullable = true)

如果您有任何与此相关的问题,请告诉我。