蜂巢中的嵌套案例-SPARK

时间:2018-12-20 11:54:03

标签: sql apache-spark dataframe case-when

我有一个看起来像这样的表(joined_df):

+------------------------------------+----------+--------------+---------------------+---------------------+
|gaid                                |event     |date_stamp_ist|first_app_access_date|first_app_viewed_date|
+------------------------------------+----------+--------------+---------------------+---------------------+
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Opened|2018-10-06    |2018-09-03           |null                 |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06    |2018-09-03           |null                 |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Opened|2018-10-06    |2018-09-03           |null                 |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06    |2018-09-03           |null                 |
|001f2ecf-bf0f-47dc-a2b2-b526b5b3292e|App Access|2018-10-06    |2018-09-03           |null                 |
+------------------------------------+----------+--------------+---------------------+---------------------+

由此,我将根据以下逻辑创建一个新的数据框:

spark.sql("SELECT gaid,MIN(CASE WHEN upper(event) in ('APP ACCESS', 'APP 
OPENED', 'APP LAUNCHED') THEN date_stamp_ist END) as 
first_app_access_date,MIN(CASE WHEN upper(event) in ('MEDIAREADY', 'MEDIA 
READY') THEN date_stamp_ist END) as first_app_viewed_date FROM joined_df 
GROUP BY gaid"

问题是,对于上面的摘要中显示的记录,已经计算出第一个app_access_date。上面的查询将重新计算该值,并使用date_stamp_ist将这个约束值更新为错误的最新值。

我想在上面的查询中插入一个用于检查的案例:

  1. 如果join_df.first_app_access_date!=“ null”,则为firs_app_access_date。但是,如果join_df.first_app_access_date ==“ null”,则:

    MIN(在('APP ACCESS','APP     OPENED','APP LAUNCHED')THEN date_stamp_ist END)as     first_app_access_date

  2. 类似检查first_app_view_date:

如果df_joined.first_app_viewed_date!=“ null”,则first_app_viewed_date 如果first_app_viewed_date!=“ null”,则:

MIN(CASE WHEN upper(event) in ('MEDIAREADY', 'MEDIA 
    READY') THEN date_stamp_ist END) as first_app_viewed_date FROM joined_df 
    GROUP BY gaid"

在情况下需要在初始查询中进行这两项检查。我不确定什么是最好的方法。

0 个答案:

没有答案