窗口功能最大值

时间:2018-06-27 00:07:15

标签: apache-spark apache-spark-sql

输入DF:

id .  sub_id .   id_created .  id_last_modified   sub_id_created . lead_
1 .    10          12:00         7:00               12:00 .        1:00
1 .    20 .        12:00         7:00                1:00 .        2:30
1 .    30 .        12:00         7:00                2:30 .        7:00
1 .    40          12:00         7:05                7:00          null

用例,我试图创建一个new_column“时间”,其中:

1. For: (id, max(sub_id)) : id_last_modified - sub_id_created
2. otherwise:  sub_id_created - lead_

代码:

window = Window.partitionBy("id").orderBy("sub_id")

我得到了所有行的预期操作,除了以下组合:

(id, max(sub_id))

我为此空的

任何有关我要去哪里的建议都将有所帮助。谢谢。

2 个答案:

答案 0 :(得分:1)

猜想这可能有用

df = df.withColumn("time",
when($"sub_id"===max($"sub_id").over(window), 
(unix_timestamp($"id_last_modified")- 
unix_timestamp($"sub_id_created"))/3600.0).otherwise( 
(unix_timestamp($"sub_id_created") - 
unix_timestamp(lead($"sub_id_created", 1).over(window)))/3600.0))

答案 1 :(得分:0)

import pandas_datareader as web
import datetime
start = datetime.datetime(2018, 5, 1)
end = datetime.datetime(2019, 5, 31)
df = web.DataReader("goog", 'yahoo', start, end)