考虑以下pyspark数据框,
df = sqlContext.createDataFrame(
[
('2019-05-08 11:00:00', 'a'),
('2019-05-08 11:02:12', 'b'),
('2019-05-08 11:04:24', 'a'),
('2019-05-08 11:06:36', 'c'),
('2019-05-08 11:08:48', 'c'),
('2019-05-08 11:11:00', 'a'),
('2019-05-08 11:13:12', 'v'),
('2019-05-08 11:23:34', 'd'),
('2019-05-08 11:26:24', 'e'),
('2019-05-08 11:28:36', 'c'),
('2019-05-08 11:30:48', 'b'),
('2019-05-08 11:35:12', 'b'),
('2019-05-08 11:37:24', 'b'),
('2019-05-08 11:44:00', 'a'),
('2019-05-08 11:48:24', 'x'),
('2019-05-08 11:50:36', 'k'),
('2019-05-08 11:55:00', 'b'),
('2019-05-08 12:01:36', 'c')
],
('datetime', 'value')
)
我试图(有效地)做的是找到30分钟的窗口随时间变化的value
的速率,每5分钟打开一次。因此,基本上我需要在时间范围内找到汇率(countDistinct(value) / (datetime.max() - datetime.min())
),并给出结果:
以此类推...
我确实尝试使用window函数,通过它我确实在非重复计数方面取得了一些成功(不支持,所以我选择了F.size(F.collect_set('value').over(w))
),但是对于自定义函数我却做不到。我也尝试过UDF,但还是没有运气。
答案 0 :(得分:2)
我不确定这是最优化的方法,但这是一种解决方案:
from pyspark.sql import functions as F, Window
df = df.withColumn("window", F.window("datetime", "5 minutes"))
df = df.withColumn(
"start",
F.unix_timestamp(F.col('window.start'))
)
df = df.withColumn(
"cnt",
F.size(F.collect_set("value").over(Window.partitionBy().orderBy("start").rangeBetween(0,1799)))
)
df = df.withColumn(
"end",
F.unix_timestamp(F.max("datetime").over(Window.partitionBy().orderBy("start").rangeBetween(0,1799)))
)
df = df.withColumn(
"start",
F.unix_timestamp(F.min("datetime").over(Window.partitionBy().orderBy("start").rangeBetween(0,1799)))
)
df.select(
F.col("window.start").alias("range_start"),
(F.unix_timestamp(F.col("window.start"))+1800).cast("timestamp").alias("range_end"),
(F.col('cnt')/(F.col("end")-F.col("start"))).alias("ratio")
).distinct().show()
+-------------------+-------------------+--------------------+
| range_start| range_end| ratio|
+-------------------+-------------------+--------------------+
|2019-05-08 11:00:00|2019-05-08 11:30:00|0.003496503496503...|
|2019-05-08 11:05:00|2019-05-08 11:35:00|0.004132231404958678|
|2019-05-08 11:10:00|2019-05-08 11:40:00|0.003787878787878788|
|2019-05-08 11:20:00|2019-05-08 11:50:00|0.004026845637583893|
|2019-05-08 11:25:00|2019-05-08 11:55:00|0.004132231404958678|
|2019-05-08 11:30:00|2019-05-08 12:00:00|0.002754820936639...|
|2019-05-08 11:35:00|2019-05-08 12:05:00|0.003156565656565...|
|2019-05-08 11:40:00|2019-05-08 12:10:00|0.004734848484848485|
|2019-05-08 11:45:00|2019-05-08 12:15:00|0.005050505050505051|
|2019-05-08 11:50:00|2019-05-08 12:20:00|0.004545454545454545|
|2019-05-08 11:55:00|2019-05-08 12:25:00|0.005050505050505051|
|2019-05-08 12:00:00|2019-05-08 12:30:00| null|
+-------------------+-------------------+--------------------+
这是我发现更加连贯的另一个版本:
df = df.withColumn("window", F.window("datetime", "5 minutes"))
df_range = df.select(F.window("datetime", "5 minutes").getItem("start").alias("range_start"))
df_range = df_range.select(
"range_start",
(F.unix_timestamp(F.col("range_start"))+1800).cast("timestamp").alias("range_end")
).distinct()
df_ratio = df.join(
df_range,
how='inner',
on=( (df.datetime >= df_range.range_start) & (df.datetime < df_range.range_end) )
)
df_ratio = df_ratio.groupBy(
"range_start",
"range_end",
).agg(
F.max("datetime").alias("max_datetime"),
F.min("datetime").alias("min_datetime"),
F.size(F.collect_set("value")).alias("nb")
)
df_ratio.select(
"range_start",
"range_end",
(F.col('nb')/(F.unix_timestamp('max_datetime')-F.unix_timestamp('min_datetime'))).alias("ratio")
).show()
+-------------------+-------------------+--------------------+
| range_start| range_end| ratio|
+-------------------+-------------------+--------------------+
|2019-05-08 11:00:00|2019-05-08 11:30:00|0.003496503496503...|
|2019-05-08 11:05:00|2019-05-08 11:35:00|0.004132231404958678|
|2019-05-08 11:10:00|2019-05-08 11:40:00|0.003787878787878788|
|2019-05-08 11:20:00|2019-05-08 11:50:00|0.004026845637583893|
|2019-05-08 11:25:00|2019-05-08 11:55:00|0.004132231404958678|
|2019-05-08 11:30:00|2019-05-08 12:00:00|0.002754820936639...|
|2019-05-08 11:35:00|2019-05-08 12:05:00|0.003156565656565...|
|2019-05-08 11:40:00|2019-05-08 12:10:00|0.004734848484848485|
|2019-05-08 11:45:00|2019-05-08 12:15:00|0.005050505050505051|
|2019-05-08 11:50:00|2019-05-08 12:20:00|0.004545454545454545|
|2019-05-08 11:55:00|2019-05-08 12:25:00|0.005050505050505051|
|2019-05-08 12:00:00|2019-05-08 12:30:00| null|
+-------------------+-------------------+--------------------+