我有三种数据类型......
1)基础数据 2)data_dict_1 3)data_dict_2
基础数据格式很好json .. 例如:
{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}
data_dict_1
1 foo
2 bar
3 foobar
....
data_dict_2
-1 foo
-2 bar
-3 foobar
... and so on
现在,我想要的是..如果数据是type1
然后从data_dict_1读取id1,从data_dict2读取id2并分配该整数id。 如果数据是type2 ..然后从data_dict_2中读取id1,从data_dict1中读取id2 ..并分配相应的id。 例如:
{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}
等等...... 我怎么在猪身上做这个?
答案 0 :(得分:1)
注意:上面示例中的内容是不有效的json,type
键未引用。
假设Pig为0.10及以上,则内置JsonLoader,您可以将模式传递给
并加载它data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');
并加载dicts
dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);
然后根据type
值
SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';
适当地 JOIN
type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;
type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;
由于模式相同,UNION
他们在一起
final_data = UNION type1_joined, type2_joined;
这会产生
DUMP final_data;
(foo,bar,type2,-2)
(foo,bar,type1,1)