PIG正则表达式提取然后过滤未命名的正则表达式元组

时间:2016-03-29 13:56:21

标签: regex apache-pig

我有一个字符串:

[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]

我希望regex_extract_all将其转换为元组中的元素并由","进行单独处理。然后我需要过滤掉那些不包含structurelocation的内容。 但是,我收到了一个无法过滤正则表达式类型的错误。任何的想法? 顺便说一句,最终目标是解析最长的层次结构,如(topic|news|politics|elections|primary)

更新脚本:

data = load load '/web/visit_log/20160303' 
            USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir

1 个答案:

答案 0 :(得分:0)

过滤器匹配的语法似乎不正确。数据似乎没有()。

c = filter b by not extr matches '(structure|location)';

尝试将此更改为

 c = filter b by not (extr matches 'structure|location');