我有一个字符串:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
我希望regex_extract_all将其转换为元组中的元素并由","
进行单独处理。然后我需要过滤掉那些不包含structure
和location
的内容。
但是,我收到了一个无法过滤正则表达式类型的错误。任何的想法?
顺便说一句,最终目标是解析最长的层次结构,如(topic|news|politics|elections|primary)
更新脚本:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
答案 0 :(得分:0)
过滤器匹配的语法似乎不正确。数据似乎没有()。
c = filter b by not extr matches '(structure|location)';
尝试将此更改为
c = filter b by not (extr matches 'structure|location');