我有一个由时间戳列和美元列组成的数据集。我想找到每行的平均美元数,在每行的时间戳结束。我最初看的是pyspark.sql.functions.window函数,但是按周分类数据。
以下是一个例子:
%pyspark
import datetime
from pyspark.sql import functions as F
df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))
w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()
这导致两条记录:
| start | end | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|
窗口功能将时间序列数据分类,而不是执行滚动平均值。
有没有办法执行滚动平均值,我会回到每行的每周平均值,时间段结束于行的timestampGMT?
修改
张的答案接近我想要的,但不完全是我想看到的。
这是一个更好的例子来展示我想要获得的东西:
%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))
这导致以下数据帧:
dollars timestampGMT rolling_average
25 2017-03-18 11:27:18.0 25
17 2017-03-10 15:27:18.0 15
13 2017-03-15 12:27:18.0 15
我希望在timestampGMT列中的日期之前的平均值超过一周,这将产生以下结果:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17
13 2017-03-15 12:27:18.0 15
25 2017-03-18 11:27:18.0 19
在上述结果中,2017-03-10的rolling_average为17,因为没有先前的记录。 2017-03-15的rolling_average为15,因为它平均是2017-03-15的13和2017-03-10的17,这是前7天的窗口。 2017-03-18的滚动平均值为19,因为它平均为2017-03-18的25和2017-03-10的13,与之前的7天窗口一致,并且不包括2017年的17 -03-10因为这不会影响前7天的窗口。
有没有办法做到这一点,而不是每周窗口不重叠的分档窗口?
答案 0 :(得分:17)
我找到了使用此stackoverflow计算移动/滚动平均值的正确方法:
Spark Window Functions - rangeBetween dates
基本思路是将timestamp列转换为秒,然后可以使用pyspark.sql.Window类中的rangeBetween函数在窗口中包含正确的行。
这是解决的例子:
%pyspark
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))
df = df.withColumn('rolling_average', F.avg("dollars").over(w))
这导致我正在寻找的滚动平均值的确切列:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17.0
13 2017-03-15 12:27:18.0 15.0
25 2017-03-18 11:27:18.0 19.0
答案 1 :(得分:2)
我将添加一个我个人认为非常有用的变体。我希望有人也会发现它有用:
如果要分组,则在各个组内计算移动平均值:
数据框示例:
from pyspark.sql.window import Window
from pyspark.sql import functions as func
df = spark.createDataFrame([("tshilidzi", 17.00, "2018-03-10T15:27:18+00:00"),
("tshilidzi", 13.00, "2018-03-11T12:27:18+00:00"),
("tshilidzi", 25.00, "2018-03-12T11:27:18+00:00"),
("thabo", 20.00, "2018-03-13T15:27:18+00:00"),
("thabo", 56.00, "2018-03-14T12:27:18+00:00"),
("thabo", 99.00, "2018-03-15T11:27:18+00:00"),
("tshilidzi", 156.00, "2019-03-22T11:27:18+00:00"),
("thabo", 122.00, "2018-03-31T11:27:18+00:00"),
("tshilidzi", 7000.00, "2019-04-15T11:27:18+00:00"),
("ash", 9999.00, "2018-04-16T11:27:18+00:00")
],
["name", "dollars", "timestampGMT"])
# we need this timestampGMT as seconds for our Window time frame
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df.show(10000, False)
输出:
+---------+-------+---------------------+
|name |dollars|timestampGMT |
+---------+-------+---------------------+
|tshilidzi|17.0 |2018-03-10 17:27:18.0|
|tshilidzi|13.0 |2018-03-11 14:27:18.0|
|tshilidzi|25.0 |2018-03-12 13:27:18.0|
|thabo |20.0 |2018-03-13 17:27:18.0|
|thabo |56.0 |2018-03-14 14:27:18.0|
|thabo |99.0 |2018-03-15 13:27:18.0|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|
|thabo |122.0 |2018-03-31 13:27:18.0|
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|
|ash |9999.0 |2018-04-16 13:27:18.0|
+---------+-------+---------------------+
要基于name
计算移动平均值并仍保持所有行,请执行以下操作:
#create window by casting timestamp to long (number of seconds)
w = (Window()
.partitionBy(col("name"))
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-days(7), 0))
df2 = df.withColumn('rolling_average', F.avg("dollars").over(w))
df2.show(100, False)
输出:
+---------+-------+---------------------+------------------+
|name |dollars|timestampGMT |rolling_average |
+---------+-------+---------------------+------------------+
|ash |9999.0 |2018-04-16 13:27:18.0|9999.0 |
|tshilidzi|17.0 |2018-03-10 17:27:18.0|17.0 |
|tshilidzi|13.0 |2018-03-11 14:27:18.0|15.0 |
|tshilidzi|25.0 |2018-03-12 13:27:18.0|18.333333333333332|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|156.0 |
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|7000.0 |
|thabo |20.0 |2018-03-13 17:27:18.0|20.0 |
|thabo |56.0 |2018-03-14 14:27:18.0|38.0 |
|thabo |99.0 |2018-03-15 13:27:18.0|58.333333333333336|
|thabo |122.0 |2018-03-31 13:27:18.0|122.0 |
+---------+-------+---------------------+------------------+
答案 2 :(得分:1)
你的意思是:
df = spark.createDataFrame([(17, "2017-03-11T15:27:18+00:00"),
(13, "2017-03-11T12:27:18+00:00"),
(21, "2017-03-17T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', f.avg("dollars").over(Window.partitionBy(f.window("timestampGMT", "7 days"))))
输出:
+-------+-------------------+---------------+
|dollars|timestampGMT |rolling_average|
+-------+-------------------+---------------+
|21 |2017-03-17 19:27:18|21.0 |
|17 |2017-03-11 23:27:18|15.0 |
|13 |2017-03-11 20:27:18|15.0 |
+-------+-------------------+---------------+
答案 3 :(得分:0)
值得注意的是,如果您不关心确切的日期-但希望获得最近30天的平均值,则可以按以下方式使用rowsBetween函数:
w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)
df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))
由于您按日期排序,因此将需要最近7次出现。 您保存了所有的内容。