由于CASE,表格在hive中为空

时间:2014-10-23 09:00:58

标签: sql hiveql

我最近开始学习sql并且我没有先前的编码经验,所以它可能只是一个愚蠢的错误(在这种情况下,抱歉长篇文章:))。如果你能帮我解决当前的问题,那就太好了。

我有一张看起来像这样的表

id / n(特定事件的名称)/ utc(时间戳)/ json_data(包含多个参数的json字符串)。

我的目标很简单:我正在尝试获取json_data中找到的值参数的总和,按n分组。不幸的是,一些问题使得执行变得更加复杂。

  1. 我们遇到了垃圾邮件问题,导致相同的事件被发送数百次或数千次,并且需要将其过滤掉。我通常通过在group子句中使用utc(时间戳)来解决它,该子句也将包括其他所选列,并获取每个特定事件的一个实例。

  2. 有些事件会在“值字段”中返回负值,并且需要在所有计数和求和中忽略这些值。

  3. 由于事情变得非常简单,json_data列中值字段的名称始终不同,具体取决于发送的事件类型。但是,我通过查询中可以看到的各种字符串操作来解决这个问题。

  4. 这是我到目前为止所得到的

    SELECT
    b.Event_Name as Event_Name
    , COUNT(b.Event_Name) as event_occurrences
    , SUM(b.item_value) as user_spendings
    FROM
        (SELECT
            a.id as Player_ID
            , a.n as Event_Name
            , a.utc as timing
            , CASE 
                WHEN 
                    MAX( a.ALPHA_Value
                    + a.BETA_Value
                    + a.GAMMA_Value
                    + a.DELTA_Value
                    + a.EPSILON_Value
                    + a.BETAUPGRADE_Value
                    + a.ZETA_Value
                    + a.ALPHASKIN_Value
                    + a.UPGRADEALPHA_Value) <= 0 
                THEN 0 
                ELSE 
                    MAX(a.ALPHA_Value
                    + a.BETA_Value
                    + a.GAMMA_Value
                    + a.DELTA_Value
                    + a.EPSILON_Value
                    + a.BETAUPGRADE_Value
                    + a.ZETA_Value
                    + a.ALPHASKIN_Value
                    + a.UPGRADEALPHA_Value) END as item_value
            FROM
                (SELECT
                    id
                    , n
                    , utc
                    , MAX(TRIM(get_json_object(json_data, '$. ALPHA_Value '))) as ALPHA_Value
                    , MAX(TRIM(get_json_object(json_data, '$. BETA_Value '))) as BETA_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. GAMMA_Value ')), 6, 
                            (LOCATE(' resource 2', 
                                SUBSTR
                                    (TRIM(get_json_object(json_data, '$. GAMMA_Value ')), 6))-1))) as GAMMA_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. DELTA_Value ')), 6)) as DELTA_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. EPSILON_Value ')), 6)) as EPSILON_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. BETAUPGRADE_Value ')), 6)) as BETAUPGRADE_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. ZETA_Value ')), 6)) as ZETA_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. ALPHASKIN_Value ')), 6)) as ALPHASKIN_Value
                    , MAX(SUBSTR
                        (TRIM(get_json_object(json_data, '$. UPGRADEALPHA_Value ')), 6, 
                            (LOCATE(' resource 2', 
                                SUBSTR
                                    (TRIM(get_json_object(json_data, '$. UPGRADEALPHA_Value ')), 6))-1))) as UPGRADEALPHA_Value
                    FROM application_events
                    WHERE
                        month = 201409
                        AND FROM_UNIXTIME(utc_timestamp) > '2014-09-04 12:00:00'
                    GROUP BY id, n, utc
                    ORDER BY id, n
                ) a
            GROUP by a.id, a.n, a.utc
            ORDER by timing, Event_Name
        ) b
    WHERE b.item_value > 0
    GROUP by b.Event_Name
    ORDER by user_spendings
    

    我的推理如下:

    1. 我从json_data中获取值,同时使用GROUP,id,n,utc删除垃圾邮件。我在get_json_object上使用MAX来允许使用前面的列进行分组。由于id,name和timestamp的组合是唯一的(除了垃圾邮件ofc),MAX将使用相同的值。 由于每个事件只有1个值字段(根据事件类型具有不同的名称),我将拥有所有列,但只有一个具有值(其他列将为空)。

      1. 我摆脱了负面价值:现在,因为我无法在where子句中加上一笔金额,我能想到的唯一方法就是创建另一个表(b)来检查是否a中所有值列的总和为负数(正如我所说,除了一个之外它们都是空的,所以如果有负数,则总和也是如此),如果不是则返回总和(别名为item_value)

      2. 第三个表最终将计算事件数并对值进行求和。

    2. 我当前的问题是在第2步。当我运行子查询a时,它看起来很好,我得到了结果。当我在原始查询(计算事件和汇总值)中运行时,我也得到了结果。所以我猜测我的条件有问题,因为完整的查询在表格中没有给我任何结果。

      我尝试将总和放在WHERE子句中,但没有用。任何想法都是受欢迎的,特别是如果你知道更简单的方法。

      非常感谢你们。

1 个答案:

答案 0 :(得分:0)

您的查询看起来是正确的,我删除了一些额外的部分(但它不是必需的):

SELECT
b.Event_Name as Event_Name
, COUNT(b.Event_Name) as event_occurrences
, SUM(b.item_value) as user_spendings
FROM (SELECT
        a.id as Player_ID
        , a.n as Event_Name
        , a.utc as timing
          COALESCE(a.ALPHA_Value, CAST(0 AS BIGINT))
        + COALESCE(a.BETA_Value, CAST(0 AS BIGINT))
        + COALESCE(a.GAMMA_Value, CAST(0 AS BIGINT))
        + COALESCE(a.DELTA_Value, CAST(0 AS BIGINT))
        + COALESCE(a.EPSILON_Value, CAST(0 AS BIGINT))
        + COALESCE(a.BETAUPGRADE_Value, CAST(0 AS BIGINT))
        + COALESCE(a.ZETA_Value, CAST(0 AS BIGINT))
        + COALESCE(a.ALPHASKIN_Value, CAST(0 AS BIGINT))
        + COALESCE(a.UPGRADEALPHA_Value, CAST(0 AS BIGINT)) as item_value
        FROM (SELECT
                id
                , n
                , utc
                , MAX(TRIM(get_json_object(json_data, '$. ALPHA_Value '))) as ALPHA_Value
                , MAX(TRIM(get_json_object(json_data, '$. BETA_Value '))) as BETA_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. GAMMA_Value ')), 6, 
                        (LOCATE(' resource 2', 
                            SUBSTR
                                (TRIM(get_json_object(json_data, '$. GAMMA_Value ')), 6))-1))) as GAMMA_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. DELTA_Value ')), 6)) as DELTA_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. EPSILON_Value ')), 6)) as EPSILON_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. BETAUPGRADE_Value ')), 6)) as BETAUPGRADE_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. ZETA_Value ')), 6)) as ZETA_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. ALPHASKIN_Value ')), 6)) as ALPHASKIN_Value
                , MAX(SUBSTR
                    (TRIM(get_json_object(json_data, '$. UPGRADEALPHA_Value ')), 6, 
                        (LOCATE(' resource 2', 
                            SUBSTR
                                (TRIM(get_json_object(json_data, '$. UPGRADEALPHA_Value ')), 6))-1))) as UPGRADEALPHA_Value
                FROM application_events
                WHERE
                    month = 201409
                    AND FROM_UNIXTIME(utc_timestamp) > '2014-09-04 12:00:00'
                GROUP BY id, n, utc
            ) a
    ) b
WHERE b.item_value > 0
GROUP by b.Event_Name
ORDER by user_spendings

我想你想要求的一些值是NULL。所以我添加了COALESCE

P.S。你不需要子查询&#34; b&#34;,你可以在子查询中做同样的事情&#34; a&#34;但我没有触及这个以获得更好的可读性