我有以下要求 -
我在一个文件A -
中有以下记录内容X Y
c1 A1 A2
c2 null / empty A2
c3 A1 null / empty
c4 B1 null / empty
c5 null / empty B2
c6 B1 B2
c7 D1 D2
c8 F1 null / empty
c9 G1 null / empty
我有另一个带有内容的小文件B -
X Y
A1 A2
B1 B2
现在我需要进行一组A-B连接,以便得到以下结果 -
内容X Y
c7 D1 D2
c8 F1 null / empty
c9 G1 null / empty
我目前正在使用复制的连接,因为我的B文件可以适合内存。但是我不知道如何在这里加入/或/或两者加入。我对db查询不太满意。
此致 阿迪亚
答案 0 :(得分:0)
我认为减法可以帮到这里:试试吧!
https://www.tutorialspoint.com/apache_pig/apache_pig_subtract.htm https://pig.apache.org/docs/r0.12.0/func.html#subtract
答案 1 :(得分:0)
以下是2个复制的JOIN和FILTER的中间和最终结果 -
cat ids_test.json
{"A":"a1","B":"a2"}
cat part-test
{"content":"both_A_a1_B_a2","meta":{"A":"a1","B":"a2"}}
{"content":"only_B_a2","meta":{"A":"","B":"a2"}}
{"content":"only_A_a1","meta":{"A":"a1","B":""}}
{"content":"both_A_b1_B_b2","meta":{"A":"b1","B":"b2"}}
{"content":"only_A_c1","meta":{"A":"c1","B":""}}
cat /tmp/j1/part-m-00000
{"user_data::json":{"meta":"{B=a2, A=a1}","content":"both_A_a1_B_a2"},"ids::json":{"B":"a2","A":"a1"}}
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=a1}","content":"only_A_a1"},"ids::json":{"B":"a2","A":"a1"}}
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null}
cat /tmp/j1_filter/part-m-00000
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null}
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null}
cat /tmp/j2/part-m-00000
{"J1_FILTER::user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"J1_FILTER::ids::json":null,"ids::json":{"B":"a2","A":"a1"}}
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"J1_FILTER::ids::json":null,"ids::json":null}
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"J1_FILTER::ids::json":null,"ids::json":null}
cat /tmp/results/part-m-00000
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"}}
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"}}
以下是代码 -
user_data = LOAD 'part-test' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
ids = LOAD 'ids_test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
J1 = JOIN user_data BY json#'meta'#'A' LEFT OUTER, ids BY json#'A' USING 'replicated';
rmf /tmp/j1
store J1 into '/tmp/j1' USING JsonStorage;
J1_FILTER = FILTER J1 BY ids::json is null;
rmf /tmp/j1_filter
store J1_FILTER into '/tmp/j1_filter' USING JsonStorage;
J2 = JOIN J1_FILTER BY user_data::json#'meta'#'B' left outer, ids BY json#'B' USING 'replicated';
rmf /tmp/j2
store J2 into '/tmp/j2' USING JsonStorage;
J2_FILTER = FILTER J2 BY ids::json is null;
RESULTS = FOREACH J2_FILTER GENERATE J1_FILTER::user_data::json;
--filtered_ids = FOREACH user_data_MINUS_ids GENERATE user_data AS data;
--DUMP filtered_ids;
rmf /tmp/results
store RESULTS into '/tmp/results' USING JsonStorage;
答案 2 :(得分:0)
我们的想法是让一个字段同时包含b.txt中字段X和Y的值,这样我们就可以只进行一次复制连接。
a.txt
c1 A1 A2
c2 A2
c3 A1
c4 B1
c5 B2
c6 B1 B2
c7 D1 D2
c8 F1
c9 G1
b.txt
A1 A2
B1 B2
<强> PigSnippet 强>
adataset = LOAD 'a.txt' USING PigStorage(' ') AS (content:chararray,key1:chararray,key2:chararray);
bdataset = LOAD 'b.txt' USING PigStorage(',') AS (key:chararray);
keys = FOREACH bdataset GENERATE FLATTEN(TOKENIZE(key,' ')) AS key;
SPLIT adataset INTO null_keys IF(key1 IS NULL AND key2 IS NULL),
not_null_keys IF NOT(key1 IS NULL AND key2 IS NULL);
joined_data = JOIN not_null_keys BY (key1 IS NULL ? key2 : key1) LEFT, keys BY key USING 'replicated';
req_data = FILTER joined_data BY keys::key IS NULL;
DUMP req_data;
输出
(c7,D1,D2,)
(c8,F1,,)
(c9,G1,,)