数据如下所示:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
这是数据结构:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
我想过滤event_list
包含2
的行。我认为最初是为了展平数据,然后过滤那些有2行的行。不知何故展平对这个数据集不起作用。
有人可以帮忙吗?
答案 0 :(得分:0)
可能有一种更简单的方法可以做到这一点,比如袋子查找等。否则基本猪的一种方法是:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
答案 1 :(得分:0)
您可以过滤Bag并投影一个布尔值,该布尔值指示bag中是否存在2。然后,过滤表示投影为真的行
所以..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;