分组来连接没有collect_list / collect_set的字符串 - Spark

时间:2018-03-28 09:18:35

标签: scala apache-spark apache-spark-sql spark-dataframe

我有以下数据框:

+------------------------------------+------------------------------+
|MeteVarID                           |Conc                          |
+------------------------------------+------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536   |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 1 0.8109076852795446   |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 2 0.7282039568471731   |
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 3 0.5335418350493728   |

我想按MeteVarID进行分组并连接字符串。最终的数据框应该是:

9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d | Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, etc.

1 个答案:

答案 0 :(得分:-1)

你可以使用普通的' RDD API并切换回数据帧。

df.rdd
  .map( c=> (c.getAs[String]("MeteVarID")  , c.getAs[String]("Conc") ) )
  .reduceByKey( _ +", "+ _)
  .toDF("MeteVarID", "Conc")
  .show(false)

+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|MeteVarID                           |Conc                                                                                                              |
+------------------------------------+------------------------------------------------------------------------------------------------------------------+
|9d71445e-ee5d-4d37-bfb7-02f6e6eacd9d|Friday 0 0.9604490986400536, Friday 1 0.8109076852795446, Friday 2 0.7282039568471731, Friday 3 0.5335418350493728|
+------------------------------------+------------------------------------------------------------------------------------------------------------------+