如何在Spark GraphX中生成与自连接等效的数据

时间:2018-06-27 19:34:49

标签: apache-spark apache-spark-sql spark-graphx

下面是输入数据:

输入

syms x y;
eqn1 = x^2 == y^2;
eqn2 = 3*x ~= 3*y;
eqn = [eqn1 eqn2];
solve(eqn)

输出

t_id,first_name,parent_id,amount,dept_id,sal, datetime_updated

1       Jared       None        1000    5       4088908   13/10/2017
2       Jared       1           -5000   1       8033313   17/10/2018
3       Jared       2           1000    5       17373148  23/07/2018
4       Tucker      None        10000   3       16320817  08/09/2018
5       Tucker      4           -10000  2       5094970   24/08/2017
6       Tucker      5           5000    1       7435169   09/11/2018
7       Tucker      5           -2500   5       7859621   21/12/2018
8       Tucker      4           3000    2       5639934   14/07/2018

SQL查询以生成以上输出

如何使用Spark GraphX lib生成上述输出序列。在给定的Input数据中,+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+ | t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id | t_id | first_name | amount | parent_id | +------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+ | 1 | Jared | 1000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 2 | Jared | -5000 | 1 | 1 | Jared | 1000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 3 | Jared | 1000 | 2 | 2 | Jared | -5000 | 1 | 1 | Jared | 1000 | 0 | NULL | NULL | NULL | NULL | | 4 | Tucker | 10000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 5 | Tucker | -10000 | 4 | 4 | Tucker | 10000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 6 | Tucker | 5000 | 5 | 5 | Tucker | -10000 | 4 | 4 | Tucker | 10000 | 0 | NULL | NULL | NULL | NULL | | 7 | Tucker | -2500 | 5 | 5 | Tucker | -10000 | 4 | 4 | Tucker | 10000 | 0 | NULL | NULL | NULL | NULL | | 8 | Thane | 3000 | 4 | 4 | Tucker | 10000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 9 | Nicholas | 1000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 10 | Mason | 2000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | 11 | Noah | 5000 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | +------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+------+------------+--------+-----------+ t_id具有父子关系船,基于该子船,我必须遍历的最大层次级别为4个级别。我可以使用SQL生成输出, 但不确定如何使用Spark GraphX lib实现相同的功能????

parent_id

0 个答案:

没有答案