在给定的一周中查找PySpark中的行数

时间:2019-11-06 13:47:13

标签: python pandas pyspark pyspark-sql pyspark-dataframes

我有一个PySpark数据框,其一小部分如下:

+------+-----+-------------------+-----+
|  name| type|          timestamp|score|
+------+-----+-------------------+-----+
| name1|type1|2012-01-10 00:00:00|   11|
| name1|type1|2012-01-10 00:00:10|   14|
| name1|type1|2012-01-10 00:00:20|    2|
| name1|type1|2012-01-10 00:00:30|    3|
| name1|type1|2012-01-10 00:00:40|   55|
| name1|type1|2012-01-10 00:00:50|   10|
| name5|type1|2012-01-10 00:01:00|    5|
| name2|type2|2012-01-10 00:01:10|    8|
| name5|type1|2012-01-10 00:01:20|    1|
|name10|type1|2012-01-10 00:01:30|   12|
|name11|type3|2012-01-10 00:01:40|  512|
+------+-----+-------------------+-----+

对于一个选定的时间窗口(例如1 week的窗口),我想找出每个score有多少num_values_week(例如name)个值。也就是说,在score之间,然后在name1之间,2012-01-10 - 2012-01-16的{​​{1}}有多少2012-01-16 - 2012-01-23值(对于所有其他名称,如{{1} }等。)

我想将此信息转换为新的PySpark数据帧,该数据帧将具有列name2nametype。我该怎么办?

上面给出的PySpark数据框可以使用以下代码段创建:

num_values_week

1 个答案:

答案 0 :(得分:1)

类似这样的东西:

from pyspark.sql.functions import weekofyear, count

df = df.withColumn( "week_nr", weekofyear(df.timestamp) ) # create the week number first
result = df.groupBy(["week_nr","name"]).agg(count("score")) # for every week see how many rows there are