我有2个表,如:
TABLE_ONE (包含500条记录的小表)
+--------+----------+------------+
| E_CODE | V_NUMBER | DATE |
+--------+----------+------------+
| E1 | 1 | 2016-01-08 |
| E2 | 1 | 2016-01-05 |
| E3 | 1 | 2016-01-06 |
| E4 | 1 | 2016-01-03 |
| E5 | 1 | 2016-01-09 |
+--------+----------+------------+
TABLE_TWO (超过100,000,000条记录的巨大表格)
+----------+-----+
|DATE |VALUE|
+----------+-----+
|2016-01-01|1 |
|2016-01-02|2 |
|2016-01-03|3 |
|2016-01-04|4 |
|2016-01-05|5 |
|2016-01-06|6 |
|2016-01-07|7 |
|2016-01-08|8 |
|2016-01-09|9 |
|2016-01-10|10 |
+----------+-----+
我想生成结果表,该结果表将具有预定范围的聚合数据。喜欢(开始日期前5天)
结果表
+--------+----------+------------+-------+
| E_CODE | V_NUMBER | DATE | VALUE |
+--------+----------+------------+-------+
| E1 | 1 | 2016-01-08 | 25 |
| E2 | 1 | 2016-01-05 | 6 |
| E3 | 1 | 2016-01-06 | 15 |
| E4 | 1 | 2016-01-03 | 3 |
| E5 | 1 | 2016-01-09 | 30 |
+--------+----------+------------+-------+
方法1 从 TABLE_TWO (类似于)
创建范围表RANGE_TABLE
+----------+----------+-----+
|START_DATE|END_DATE |VALUE|
+----------+----------+-----+
|2016-01-01|2016-01-01|1 |
|2016-01-01|2016-01-02|3 |
|2016-01-01|2016-01-03|6 |
|2016-01-01|2016-01-04|10 |
|2016-01-01|2016-01-05|15 |
|2016-01-02|2016-01-06|20 |
|2016-01-03|2016-01-07|25 |
|2016-01-04|2016-01-08|30 |
|2016-01-05|2016-01-09|35 |
|2016-01-06|2016-01-10|40 |
|2016-01-07|2016-01-10|34 |
|2016-01-08|2016-01-10|27 |
|2016-01-09|2016-01-10|19 |
|2016-01-10|2016-01-10|10 |
+----------+----------+-----+
然后将其与 t2_range.END_DATE = date_sub(t1.date,1)上的 TABLE_ONE 连接。我还没有找到创建范围表的方法。
方法2 做一个不等式连接并获得所需的数据:
select
t1.e_code,
t1.v_number,
min(t1.date),
sum(t2.value) as value
from
table_one t1 join
table_two t2 on true
where
t2.date between
DATE_SUB(t1.date, 6) and
DATE_SUB(t1.date, 1)
group by
t1.e_code,
t1.v_number
因为它是不等式连接查询,所以只有1个reducer,因此查询太慢了。即使是小型数据集,也需要很长时间才能获取结果。
问题: 我该如何解决这个问题?