SparkSQL:SQL.DataFrame.Aggregate函数在多个具有不同操作的列上

时间:2017-03-31 08:57:05

标签: dataframe rdd pyspark-sql

在执行分组时,是否有任何方法可以应用聚合函数来连接或添加数据框列中的元组列表?

我的数据框如下所示:

+--------+-----+-------------+----------------+
|WindowID|State|         City|         Details|
+--------+-----+-------------+----------------+
|       1|   IA|         Ames|   [(524292, 2)]|
|       6|   PA|  Bala Cynwyd|       [(6, 48)]|
|       7|   AL|   Birmingham|  [(1048584, 6)]|
|       1|   FL|      Orlando|      [(18, 27)]|
|       7|   TN|    Nashville|  [(1048608, 9)]|
+--------+-----+-------------+----------------+

我的目标是对' WindowID'中具有相同值的行进行分组。并合并列的内容' State'和' City'进入字符串列表和列的内容'详细信息'进入元组列表。

结果必须如下所示:

+--------+---------+------------------------+-----------------------------+
|WindowID|    State|                    City|                      Details|
+--------+---------+------------------------+-----------------------------+
|       1| [IA, FL]|         [Ames, Orlando]|      [(524292, 2), (18, 27)]|
|       6|     [PA]|           [Bala Cynwyd]|                    [(6, 48)]|
|       7| [AL, TN]| [Birmingham, Nashville]| [(1048584, 6), (1048608, 9)]|
+--------+---------+------------------------+-----------------------------+

我的代码是:

sqlc = SQLContext(sc)
df = sqlc.createDataFrame(rdd, ['WindowID', 'State', 'City', 'Details'])
df1 = df.groupBy('WindowID').agg( // Here i want to do merge operation. )

如何在python中使用spark sql数据数据框执行此操作。

1 个答案:

答案 0 :(得分:3)

为输入数据框创建数据:

data =[(1, 'IA', 'Ames', (524292, 2)),
          (6, 'PA', 'Bala Cynwyd', (6, 48)),
          (7, 'AL', 'Birmingham', (1048584, 6)),
          (1, 'FL', 'Orlando', (18, 27)),
          (7, 'TN', 'Nashville', (1048608, 9))]
table = sqlContext.createDataFrame(data, ['WindowId', 'State', 'City', 'Details'])
table.show()
+--------+-----+-----------+-----------+
|WindowId|State|       City|    Details|
+--------+-----+-----------+-----------+
|       1|   IA|       Ames| [524292,2]|
|       6|   PA|Bala Cynwyd|     [6,48]|
|       7|   AL| Birmingham|[1048584,6]|
|       1|   FL|    Orlando|    [18,27]|
|       7|   TN|  Nashville|[1048608,9]|
+--------+-----+-----------+-----------+

使用collect_list聚合函数:

from pyspark.sql.functions import collect_list
table.groupby('WindowId').agg(collect_list('State').alias('State'),
                              collect_list('City').alias('City'),
                              collect_list('Details').alias('Details')).show()
+--------+--------+--------------------+--------------------+
|WindowId|   State|                City|             Details|
+--------+--------+--------------------+--------------------+
|       1|[FL, IA]|     [Orlando, Ames]|[[18,27], [524292...|
|       6|    [PA]|       [Bala Cynwyd]|            [[6,48]]|
|       7|[AL, TN]|[Birmingham, Nash...|[[1048584,6], [10...|
+--------+--------+--------------------+--------------------+