我有一个看起来像这样的表(joined_df):
+------------------------------------+----------+--------------+---------------------+---------------------+
|gaid |event |date_stamp_ist|first_app_access_date|first_app_viewed_date|
+------------------------------------+----------+--------------+---------------------+---------------------+
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Opened|2018-10-06 |2018-09-03 |null |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06 |2018-09-03 |null |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Opened|2018-10-06 |2018-09-03 |null |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06 |2018-09-03 |null |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06 |2018-09-03 |null |
+------------------------------------+----------+--------------+---------------------+---------------------+
由此,我将根据以下逻辑创建一个新的数据框:
spark.sql("SELECT gaid,MIN(CASE WHEN upper(event) in ('APP ACCESS', 'APP
OPENED', 'APP LAUNCHED') THEN date_stamp_ist END) as
first_app_access_date,MIN(CASE WHEN upper(event) in ('MEDIAREADY', 'MEDIA
READY') THEN date_stamp_ist END) as first_app_viewed_date FROM joined_df
GROUP BY gaid"
问题是,对于上面的摘要中显示的记录,已经计算出第一个app_access_date。上面的查询将重新计算该值,并使用date_stamp_ist将这个约束值更新为错误的最新值。
我想在上面的查询中插入一个用于检查的案例:
如果join_df.first_app_access_date!=“ null”,则为firs_app_access_date。但是,如果join_df.first_app_access_date ==“ null”,则:
MIN(在('APP ACCESS','APP OPENED','APP LAUNCHED')THEN date_stamp_ist END)as first_app_access_date
如果df_joined.first_app_viewed_date!=“ null”,则first_app_viewed_date 如果first_app_viewed_date!=“ null”,则:
MIN(CASE WHEN upper(event) in ('MEDIAREADY', 'MEDIA
READY') THEN date_stamp_ist END) as first_app_viewed_date FROM joined_df
GROUP BY gaid"
在情况下需要在初始查询中进行这两项检查。我不确定什么是最好的方法。