使用GraphX spark lib为以下Hive联接生成层次结构序列

时间:2018-06-26 07:09:02

标签: mysql hadoop apache-spark-sql hiveql spark-graphx

在样本数据集的id下面,用于“ t_id”和“ parent_id”具有依赖关系的事务。

t_id 名字父母ID 金额部门ID sal datetime_updated

1       Jared       None        1000    5       4088908   13/10/2017
2       Jared       1           -5000   1       8033313   17/10/2018
3       Jared       2           1000    5       17373148  23/07/2018
4       Tucker      None        10000   3       16320817  08/09/2018
5       Tucker      4           -10000  2       5094970   24/08/2017
6       Tucker      5           5000    1       7435169   09/11/2018
7       Tucker      5           -2500   5       7859621   21/12/2018
8       Tucker      4           3000    2       5639934   14/07/2018

下面使用的查询

select 
t1.t_id ,
t1.first_name,
t1.amount,
t1.parent_id,
t2.t_id ,
t2.first_name,
t2.amount,
t2.parent_id,
t3.t_id ,
t3.first_name,
t3.amount,
t3.parent_id,
t4.t_id ,
t4.first_name,
t4.amount,
t4.parent_id
from Transactions t1
left join Transactions t2
on t1.parent_id = t2.t_id
left join Transactions t3
on t2.parent_id = t3.t_id
left join Transactions t4
on t3.parent_id = t4.t_id;

上述查询的输出

+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+
| t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id |
+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+
|    1 | Jared      |   1000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|    2 | Jared      |  -5000 |         1 |    1 | Jared      |   1000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|    3 | Jared      |   1000 |         2 |    2 | Jared      |  -5000 |         1 |    1 | Jared      |   1000 |         0 | NULL | NULL       |   NULL |      NULL |
|    4 | Tucker     |  10000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|    5 | Tucker     | -10000 |         4 |    4 | Tucker     |  10000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|    6 | Tucker     |   5000 |         5 |    5 | Tucker     | -10000 |         4 |    4 | Tucker     |  10000 |         0 | NULL | NULL       |   NULL |      NULL |
|    7 | Tucker     |  -2500 |         5 |    5 | Tucker     | -10000 |         4 |    4 | Tucker     |  10000 |         0 | NULL | NULL       |   NULL |      NULL |
|    8 | Thane      |   3000 |         4 |    4 | Tucker     |  10000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|    9 | Nicholas   |   1000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|   10 | Mason      |   2000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
|   11 | Noah       |   5000 |         0 | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL | NULL | NULL       |   NULL |      NULL |
+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+

问题/问题

I want generate the same output as mention above results, 
but I cannot use the above join condition 
as it is failing over larger data set when working on spark-sql.

Is there any other way I can optimise the above query to generate same 
kind of data.

0 个答案:

没有答案