Question

输入DF：

id .  sub_id .   id_created .  id_last_modified   sub_id_created . lead_
1 .    10          12:00         7:00               12:00 .        1:00
1 .    20 .        12:00         7:00                1:00 .        2:30
1 .    30 .        12:00         7:00                2:30 .        7:00
1 .    40          12:00         7:05                7:00          null

用例，我试图创建一个new_column“时间”，其中：

1. For: (id, max(sub_id)) : id_last_modified - sub_id_created
2. otherwise:  sub_id_created - lead_

代码：

window = Window.partitionBy("id").orderBy("sub_id")

我得到了所有行的预期操作，除了以下组合：

(id, max(sub_id))

我为此空的

任何有关我要去哪里的建议都将有所帮助。谢谢。

Answer 1

猜想这可能有用

df = df.withColumn("time",
when($"sub_id"===max($"sub_id").over(window), 
(unix_timestamp($"id_last_modified")- 
unix_timestamp($"sub_id_created"))/3600.0).otherwise( 
(unix_timestamp($"sub_id_created") - 
unix_timestamp(lead($"sub_id_created", 1).over(window)))/3600.0))

Answer 2

import pandas_datareader as web
import datetime
start = datetime.datetime(2018, 5, 1)
end = datetime.datetime(2019, 5, 31)
df = web.DataReader("goog", 'yahoo', start, end)

窗口功能最大值

2 个答案: