在Pig中过滤内袋

时间:2017-06-15 23:25:40

标签: hadoop apache-pig

数据如下所示:

22678, {(112),(110),(2)}      
656565, {(110), (109)}      
6676, {(2),(112)}    

这是数据结构:

(id:chararray, event_list:{innertuple:(innerfield:chararray)})

我想过滤event_list包含2的行。我认为最初是为了展平数据,然后过滤那些有2行的行。不知何故展平对这个数据集不起作用。

有人可以帮忙吗?

2 个答案:

答案 0 :(得分:0)

可能有一种更简单的方法可以做到这一点,比如袋子查找等。否则基本猪的一种方法是:

data = load 'data.txt'  AS (id:chararray, event_list:bag{});

-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);

-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;

-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));

-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;

-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;

(22678,{(112),(110),(2)})
(6676,{(2),(112)})

答案 1 :(得分:0)

您可以过滤Bag并投影一个布尔值,该布尔值指示bag中是否存在2。然后,过滤表示投影为真的行

所以..

input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
     bag_filter = FILTER event_list BY (val_0 matches '2');
       GENERATE
          id, 
          event_list,
          isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
          ;
};
output = FILTER input_filt BY is_2_present;