我有一个大约十亿游戏互动的表格,其中每个记录描述特定日期游戏玩家的属性(游戏玩家ID,玩游戏的日期,属性1,[属性2,...])< / p>
+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+
|gmr_id | played_dt | attr1 | attr2 | attr3 |
+-------+-----------+-------+-------+-------+
|1 | 2017-01-01| 1 | 2 | txt |
|1 | 2017-01-02| 3 | 2 | txt |
|2 | 2017-01-02| 1 | 2 | txt |
+-------+-----------+-------+-------+-------+
我有另一张桌子,上面有数百万条记录,每场比赛都会记录游戏玩家的动作:
+---------+
| TABLE_2 |
+-------+-----------+---------+---------+---------+
|gmr_id | played_dt | finish | attacks | deaths |
+-------+-----------+---------+---------+---------+
|1 | 2017-01-01| 10 | 1 | 9 |
|1 | 2017-01-03| 12 | 10 | 2 |
|2 | 2017-01-02| 1 | 0 | 0 |
|4 | 2017-01-03| 1 | 0 | 1 |
|1 | 2017-01-04| 3 | 1 | 2 |
+-------+-----------+---------+---------+---------+
对于TABLE_1中的每个记录 - 特别是对于每个gmr_id和playing_dt,我试图将play_dt的接下来两天中的移动总和与之前的五天进行比较(如果为真则为1,否则为0)并基于gmr_id加入TABLE_1和playing_dt: 即。
BETWEEN played_dt AND DATE_ADD(played_dt, 2)
BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
加入行:gmr_id, played_dt, finish_f, attack_f
使用TABLE_1行:gmr_id, played_dt, attr1, attr2, attr3
在gmr_id, played_dt
我曾尝试编写相关的子查询但无效:
SELECT
t1.gmr_id,
t1.played_dt,
(SELECT
t2.gmr_id,
SUM(t2.finish) `future_finish`,
SUM(t2.attacks) `future_attacks`
FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN played_dt AND DATE_ADD(played_dt, 2)
GROUP BY t2.gmr_id),
(SELECT
t2.gmr_id,
SUM(t2.finish) `past_finish`,
SUM(t2.attacks) `past_attacks`
FROM TABLE_2 t2 WHERE t2.played_dt BETWEEN DATE_SUB(played_dt, 5) AND DATE_SUB(played_dt, 1)
GROUP BY t2.gmr_id),
CASE WHEN future_finish > past_finish THEN 1 ELSE 0 END `finish_f`,
CASE WHEN future_attacks > past_attacks THEN 1 ELSE 0 END `attack_f`
FROM
TABLE_1 t1;
预期输出如下:
+---------+
| TABLE_1 |
+-------+-----------+-------+-------+-------+-----------+-----------+
|gmr_id | played_dt | attr1 | attr2 | attr3 | finish_f | attack_f |
+-------+-----------+-------+-------+-------+-----------+-----------+
|1 | 2017-01-01| 1 | 2 | txt | 1 | 0 |
|1 | 2017-01-02| 3 | 2 | txt | 1 | 1 |
|2 | 2017-01-02| 1 | 2 | txt | 0 | 1 |
+-------+-----------+-------+-------+-------+-----------+-----------+
我正在使用Hive 1.2(或者可以使用Spark 1.5)来做到这一点,但到目前为止我一直无法做到这一点。实现这一目标的最佳方法是什么?我非常感谢你的帮助。