Hive / Spark SQL中的相关子查询问题,用于比较聚合

时间:2017-06-18 00:40:24

标签: sql apache-spark hive apache-spark-sql hiveql

我有一个大约十亿游戏互动的表格,其中每个记录描述特定日期游戏玩家的属性(游戏玩家ID,玩游戏的日期,属性1,[属性2,...])< / p>

+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+
|gmr_id | played_dt | attr1 | attr2 | attr3 |
+-------+-----------+-------+-------+-------+
|1      | 2017-01-01| 1     | 2     | txt   |
|1      | 2017-01-02| 3     | 2     | txt   |
|2      | 2017-01-02| 1     | 2     | txt   |
+-------+-----------+-------+-------+-------+

我有另一张桌子,上面有数百万条记录,每场比赛都会记录游戏玩家的动作:

+---------+
| TABLE_2 |
+-------+-----------+---------+---------+---------+
|gmr_id | played_dt | finish  | attacks | deaths  |
+-------+-----------+---------+---------+---------+
|1      | 2017-01-01| 10      | 1       | 9       |
|1      | 2017-01-03| 12      | 10      | 2       |
|2      | 2017-01-02| 1       | 0       | 0       |
|4      | 2017-01-03| 1       | 0       | 1       |
|1      | 2017-01-04| 3       | 1       | 2       |
+-------+-----------+---------+---------+---------+

对于TABLE_1中的每个记录 - 特别是对于每个gmr_id和playing_dt,我试图将play_dt的接下来两天中的移动总和与之前的五天进行比较(如果为真则为1,否则为0)并基于gmr_id加入TABLE_1和playing_dt: 即。

  1. gmr_id从playing_dt到两天后的完成,攻击和死亡的总和,即SUM(完成),SUM(攻击)等。BETWEEN played_dt AND DATE_ADD(played_dt, 2)
  2. gmr_id的结束,攻击和死亡的总和,从五天前到一天前的play_dt,即SUM(完成),SUM(攻击)等。BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
  3. 比较并设置标志即获得如下行:gmr_id,playing_dt,future_finish_gt_past_finish(如果在play_dt之后的天数内完成,则为1,否则为0之前的天数),future_attacks_gt_past_attacks(如果在playing_dt之后的天数中的攻击大于之前的几天,则为1)别的0)等等。
  4. 加入行:gmr_id, played_dt, finish_f, attack_f 使用TABLE_1行:gmr_id, played_dt, attr1, attr2, attr3gmr_id, played_dt

    我曾尝试编写相关的子查询但无效:

    SELECT
      t1.gmr_id,
      t1.played_dt,    
      (SELECT
        t2.gmr_id,
        SUM(t2.finish) `future_finish`,
        SUM(t2.attacks) `future_attacks`
      FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN played_dt AND DATE_ADD(played_dt, 2)
      GROUP BY t2.gmr_id), 
      (SELECT
        t2.gmr_id,
        SUM(t2.finish) `past_finish`,
        SUM(t2.attacks) `past_attacks`
      FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
      GROUP BY t2.gmr_id),
    
    CASE WHEN future_finish > past_finish THEN 1 ELSE 0 END `finish_f`,
    
    CASE WHEN future_attacks > past_attacks THEN 1 ELSE 0 END `attack_f`
    
    FROM
      TABLE_1 t1;
    

    预期输出如下:

    +---------+
    | TABLE_1 |
    +-------+-----------+-------+-------+-------+-----------+-----------+
    |gmr_id | played_dt | attr1 | attr2 | attr3 | finish_f  | attack_f  |
    +-------+-----------+-------+-------+-------+-----------+-----------+
    |1      | 2017-01-01| 1     | 2     | txt   |     1     |     0     |
    |1      | 2017-01-02| 3     | 2     | txt   |     1     |     1     |
    |2      | 2017-01-02| 1     | 2     | txt   |     0     |     1     |
    +-------+-----------+-------+-------+-------+-----------+-----------+
    

    我正在使用Hive 1.2(或者可以使用Spark 1.5)来做到这一点,但到目前为止我一直无法做到这一点。实现这一目标的最佳方法是什么?我非常感谢你的帮助。

0 个答案:

没有答案