我有一个Pyspark Dataframe,格式如下:
+------------+---------+
| date | query |
+------------+---------+
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 2 |
| 2011-08-12 | Query 3 |
| 2011-08-12 | Query 3 |
| 2011-08-13 | Query 1 |
+------------+---------+
我需要转换它以将每个唯一查询转换为按日期分组的列,并在数据帧的行中插入每个查询的计数。我希望输出是这样的:
+------------+---------+---------+---------+
| date | Query 1 | Query 2 | Query 3 |
+------------+---------+---------+---------+
| 2011-08-11 | 2 | 1 | 0 |
| 2011-08-12 | 0 | 0 | 2 |
| 2011-08-13 | 1 | 0 | 0 |
+------------+---------+---------+---------+
我尝试使用this answer作为示例,但我不太了解代码,尤其是return
函数中的make_row
语句。
有没有办法在转换DataFrame时计算查询? 也许像是
import pyspark.sql.functions as func
grouped = (df
.map(lambda row: (row.date, (row.query, func.count(row.query)))) # Just an example. Not sure how to do this.
.groupByKey())
这是一个可能包含数十万行和查询的数据框,所以我更喜欢使用RDD版本而不是使用.collect()
谢谢!
答案 0 :(得分:2)
您可以将groupBy.pivot
与count
一起用作聚合函数:
from pyspark.sql.functions import count
df.groupBy('date').pivot('query').agg(count('query')).na.fill(0).orderBy('date').show()
+--------------------+-------+-------+-------+
| date|Query 1|Query 2|Query 3|
+--------------------+-------+-------+-------+
|2011-08-11 00:00:...| 2| 1| 0|
|2011-08-12 00:00:...| 0| 0| 2|
|2011-08-13 00:00:...| 1| 0| 0|
+--------------------+-------+-------+-------+