Pig脚本/命令用于过滤特定字符串

时间:2016-09-01 07:13:40

标签: json hadoop twitter apache-pig tweets

我正在尝试编写Hadoop Pig脚本,该脚本将采用2个文件并根据字符串进行过滤,即

words.txt

google 
facebook 
twitter 
linkedin

tweets.json

{"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...", "user_id": 450990391, "id": 252479809098223616, "created_date": "Sun Sep 30 2012"}

SCRIPT

twitter  = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
    filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
    extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;
    final = GROUP extracted BY pattern;
    dump final;

输出

(facebook,{(facebook,252545104890449921,291041644,23:06:59 ,Sun Sep 30 2012,RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...)})

我得到的输出是,没有加载words.txt文件,即直接过滤推文。

我需要输出

(facebook)(complete tweet of that facebook word contained)

即它应该读取words.txt,因为单词正在读取它应该从tweets.json文件获取所有推文

任何帮助

Mohan.V

1 个答案:

答案 0 :(得分:0)

您可以在FOREACH语句中考虑运行多个语句的方向。像这样的东西 -

final = FOREACH words  {
            a = CONCAT(CONCAT('.*',words.$0),'.*') as aaa;
            filtered = FILTER twitter BY (text MATCHES aaa);
        generate a, flatten(filtered) as output; }

请注意,这只是为了提出一个想法,我还没有测试过。我会在访问Pig环境后立即尝试测试,但这应该可以让你开始。