我有一行如下:
evnt=redeem&lid=1030023&upt=1679&pid=000000000001076056,000000000001072654,000000000001067925&ppt=996,246,366&qty=1,2,3
我有兴趣从行中提取lid,pid,ppt和qty,并为pid,ppt和qty中的每个条目创建一个元组。请注意,规则是:
lid=4&pid=1,2&qty=2,3&ppt=123,232
表示lid=4
和pid=1
然后是qty=2
和ppt=123
,如果lid=4
和pid=2
那么{{} 1}}和qty=3
。我已经能够通过以下内容实现这些领域的盖子和pid:
ppt=232
这给了我:
logs = foreach logs generate
REGEX_EXTRACT(original_path, 'lid=([^&]+)', 1) as login_id,
FLATTEN(TOKENIZE(REPLACE(REGEX_EXTRACT(original_path, '.*pid=([^&]+)', 1), ',', ' '))) as pid;
但是,我也希望对其他两个字段执行此操作(将其保留为三个元组),并且在同一个foreach语句中多个flattens不会给我我想要的内容。
1030023 000000000001076056
1030023 000000000001072654
1030023 000000000001067925
我猜这将需要一个UDF,但是我想通过仅使用Pig中提供的函数来了解它是否还有另一种方法。
答案 0 :(得分:1)
我有点不确定你想要输出的确切程度,但这就是你如何在纯猪中做到这一点。
在我看来,当你不知道田地的数量时,元组在猪身上有点尴尬。因此,如果数字的顺序无关紧要,我建议使用行李。在这种情况下,TOKENIZE将创建输出作为包,STRSPLIT将输出创建为元组。
此代码:
A = LOAD 'logs' AS (total:chararray);
B = FOREACH A {
-- In this case a nested foreach makes the code much easier to read.
lid = REGEX_EXTRACT(total, 'lid=([^&]+)', 1) ;
-- TOKENIZE splits on ',' creating a bag.
pid = TOKENIZE(REGEX_EXTRACT(total, '.*pid=([^&]+)', 1), ',') ;
-- STRSPLIT splits on ',' creating a tuple.
ppt = STRSPLIT(REGEX_EXTRACT(total, '.*ppt=([^&]+)', 1), ',') ;
qty = STRSPLIT(REGEX_EXTRACT(total, '.*qty=([^&]+)', 1), ',') ;
GENERATE lid as lid, FLATTEN(pid) as pid, ppt as ppts, qty as qtys ;
}
生成此架构并输出:
B: {lid: chararray,pid: chararray,ppts: (),qtys: ()}
(1030023,000000000001076056,(996,246,366),(1,1,1))
(1030023,000000000001072654,(996,246,366),(1,1,1))
(1030023,000000000001067925,(996,246,366),(1,1,1))
使用TOKENIZE制作行李而不是元组会创建此输出:
B: {lid: chararray,pid: chararray,ppts: {tuple_of_tokens: (token: chararray)},qtys: {tuple_of_tokens: (token: chararray)}}
(1030023,000000000001076056,{(996),(246),(366)},{(1),(1),(1)})
(1030023,000000000001072654,{(996),(246),(366)},{(1),(1),(1)})
(1030023,000000000001067925,{(996),(246),(366)},{(1),(1),(1)})
如果你想让pid也是一个元组,那么只需改变这两行:
pid = TOKENIZE(REGEX_EXTRACT(total, '.*pid=([^&]+)', 1), ',') ;
GENERATE lid as lid, FLATTEN(pid) as pid, ppt as ppts, qty as qtys ;
致:
pid = STRSPLIT(REGEX_EXTRACT(total, '.*pid=([^&]+)', 1), ',') ;
GENERATE lid as lid, pid as pid, ppt as ppts, qty as qtys ;