我正在尝试将附加的T-sql代码转换为Pyspark脚本
CASE
WHEN min(t.timestamp_start) >= dateadd(hour, (8 - 2), t.date) AND min(t.timestamp_start) < dateadd(hour, (20 - 2), t.date) THEN 'Day'
WHEN min(t.timestamp_start) >= dateadd(hour, (18 - 2), t.date) OR min(t.timestamp_start) < dateadd(hour, (8 - 2 + 24), t.date) THEN 'Night'
WHEN max(t.timestamp_end) >= dateadd(hour, (8 + 4), t.date) AND max(t.timestamp_end) < dateadd(hour, (20 + 4), t.date) THEN 'Day'
WHEN max(t.timestamp_end) >= dateadd(hour, 24, t.date) OR max(t.timestamp_end) < dateadd(hour, (12 + 24), t.date) THEN 'Night'
ELSE NULL
END AS shifttype
FROM Demo_Dev.t
WHERE t.date > '2016-11-15'
GROUP BY t."individual id", t.date;
通常我使用.groupby()。agg(max / min /)来获取最小值和最大值并将其存储在数据框中。例如。
Tgt_df_tos_a = Tgt_df_tos_f.groupBy(col('barcode'), col('eventdate')).agg(F.max("next_eventdate"))
但是使用上述方法将很乏味。我如何才能更有效地做到这一点。预先感谢!
这就是我在Pyspark中实现的
Tgt_df_time_on_site_open_union.withColumn("shifttype", F.when( (F.min(F.col("timestamp_start")) >= unix_timestamp('date') + 6*60*60) & (F.min(F.col("timestamp_start")) < unix_timestamp('date') + 18*60*60) , 'Day' )
.when( (F.min(F.col("timestamp_start")) >= unix_timestamp('date') + 16*60*60) | (F.min(F.col("timestamp_start")) < unix_timestamp('date') + 30*60*60) , 'Night' )
.when( (F.max(F.col("timestamp_start")) >= unix_timestamp('date') + 12*60*60) & (F.max(F.col("timestamp_start")) < unix_timestamp('date') + 24*60*60) , 'Day' )
.when( (F.max(F.col("timestamp_start")) >= unix_timestamp('date') + 24*60*60) | (F.max(F.col("timestamp_start")) < unix_timestamp('date') + 36*60*60) , 'Night' ).otherwise("NULL"))