我有一个连接数据集,每个行标记A
以B
的形式连接A B
。 A
和B
之间的直接关联只会以A B
或B A
的形式出现一次。我希望最多只能在一跳之后找到所有连接,例如A
和C
最多只有一跳,如果A
和C
直接连接,或{ {1}}将A
与某些C
相关联。
例如,我有以下直接连接数据
B
然后我想要的结果数据是
1 2
2 4
3 7
4 5
有人能帮我找到尽可能高效的方法吗?谢谢。
答案 0 :(得分:1)
你可以这样做:
<强> myudf.py 强>
@outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
<强> script.pig 强>
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
使用您的示例输入的G
的模式和输出:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})