Question

我有一个字符串：

[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]

我希望regex_extract_all将其转换为元组中的元素并由","进行单独处理。然后我需要过滤掉那些不包含structure和location的内容。但是，我收到了一个无法过滤正则表达式类型的错误。任何的想法？顺便说一句，最终目标是解析最长的层次结构，如(topic|news|politics|elections|primary)

更新脚本：

data = load load '/web/visit_log/20160303' 
            USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir

Answer 1

过滤器匹配的语法似乎不正确。数据似乎没有（）。

c = filter b by not extr matches '(structure|location)';

尝试将此更改为

 c = filter b by not (extr matches 'structure|location');

PIG正则表达式提取然后过滤未命名的正则表达式元组

1 个答案: