我有一个PySpark数据框,其一小部分如下:
+------+-----+-------------------+-----+
| name| type| timestamp|score|
+------+-----+-------------------+-----+
| name1|type1|2012-01-10 00:00:00| 11|
| name1|type1|2012-01-10 00:00:10| 14|
| name1|type1|2012-01-10 00:00:20| 2|
| name1|type1|2012-01-10 00:00:30| 3|
| name1|type1|2012-01-10 00:00:40| 55|
| name1|type1|2012-01-10 00:00:50| 10|
| name5|type1|2012-01-10 00:01:00| 5|
| name2|type2|2012-01-10 00:01:10| 8|
| name5|type1|2012-01-10 00:01:20| 1|
|name10|type1|2012-01-10 00:01:30| 12|
|name11|type3|2012-01-10 00:01:40| 512|
+------+-----+-------------------+-----+
对于一个选定的时间窗口(例如1 week
的窗口),我想找出每个score
有多少num_values_week
(例如name
)个值。也就是说,在score
之间,然后在name1
之间,2012-01-10 - 2012-01-16
的{{1}}有多少2012-01-16 - 2012-01-23
值(对于所有其他名称,如{{1} }等。)
我想将此信息转换为新的PySpark数据帧,该数据帧将具有列name2
,name
,type
。我该怎么办?
上面给出的PySpark数据框可以使用以下代码段创建:
num_values_week
答案 0 :(得分:1)
类似这样的东西:
from pyspark.sql.functions import weekofyear, count
df = df.withColumn( "week_nr", weekofyear(df.timestamp) ) # create the week number first
result = df.groupBy(["week_nr","name"]).agg(count("score")) # for every week see how many rows there are