输入DF:
id . sub_id . id_created . id_last_modified sub_id_created . lead_
1 . 10 12:00 7:00 12:00 . 1:00
1 . 20 . 12:00 7:00 1:00 . 2:30
1 . 30 . 12:00 7:00 2:30 . 7:00
1 . 40 12:00 7:05 7:00 null
用例,我试图创建一个new_column“时间”,其中:
1. For: (id, max(sub_id)) : id_last_modified - sub_id_created
2. otherwise: sub_id_created - lead_
代码:
window = Window.partitionBy("id").orderBy("sub_id")
我得到了所有行的预期操作,除了以下组合:
(id, max(sub_id))
我为此空的
任何有关我要去哪里的建议都将有所帮助。谢谢。
答案 0 :(得分:1)
猜想这可能有用
df = df.withColumn("time",
when($"sub_id"===max($"sub_id").over(window),
(unix_timestamp($"id_last_modified")-
unix_timestamp($"sub_id_created"))/3600.0).otherwise(
(unix_timestamp($"sub_id_created") -
unix_timestamp(lead($"sub_id_created", 1).over(window)))/3600.0))
答案 1 :(得分:0)
import pandas_datareader as web
import datetime
start = datetime.datetime(2018, 5, 1)
end = datetime.datetime(2019, 5, 31)
df = web.DataReader("goog", 'yahoo', start, end)