我正在与pyspark
合作,我想运行一个spark.sql
查询来计算某些值的每小时平均值。
我有一个类似以下的表格
ID timestamp val
A 2020-01-19 03:03:00 5
A 2020-01-19 03:33:00 3
A 2020-01-19 03:55:00 7
A 2020-01-20 05:44:00 6
A 2020-01-20 05:54:00 4
B 2020-01-19 02:15:00 1
B 2020-01-19 02:22:00 0
B 2020-01-19 06:15:00 9
B 2020-01-19 06:44:00 2
我想要一张像下面的桌子
ID time avgval
A 2020-01-19 03:00:00 5
A 2020-01-20 05:00:00 5
B 2020-01-19 02:00:00 1
B 2020-01-19 06:00:00 5.5
答案 0 :(得分:0)
可以通过具有group by
函数的简单date_format
查询来实现。
spark.sql(
"""
SELECT ID
, date_format(timestamp, 'yyyy-MM-dd HH:00:00') as time
, mean(val) as avgval
FROM table
GROUP BY ID
, date_format(timestamp, 'yyyy-MM-dd HH:00:00')
ORDER BY ID
, date_format(timestamp, 'yyyy-MM-dd HH:00:00')
""") \
.show(20, False)
结果是:
+---+-------------------+------+
|ID |time |avgval|
+---+-------------------+------+
|A |2020-01-19 03:00:00|5.0 |
|A |2020-01-20 05:00:00|5.0 |
|B |2020-01-19 02:00:00|0.5 |
|B |2020-01-19 06:00:00|5.5 |
+---+-------------------+------+
答案 1 :(得分:-1)
我建议您使用date_trunc()
:
select id, date_trunc('hour', time) as yyyymmddhh,
avg(val)
from t
group by id, date_trunc('hour', time);