获取基于SPECIFIC字词的所有推文,并在SINGLE BAG

时间:2016-08-31 08:28:37

标签: json hadoop apache-pig hadoop-streaming

我正在尝试处理示例推文,并根据过滤后的标准存储推文。

例如,

示例推文: -

{"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a word about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...", "user_id": 450990391, "id": 252479809098223616, "created_date": "Sun Sep 30 2012"}

twitter = LOAD 'Tweet.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
grouped = GROUP twitter BY (text,id);
filtered =FOREACH grouped { row = FILTER $1 BY (text MATCHES '.*word.*'); GENERATE FLATTEN(row);}

它获得与单词匹配的完整推文。

但我需要得到如下输出:

(word)(all tweets of contained that word)

我怎样才能做到这一点?

任何帮助。

Mohan.V

1 个答案:

答案 0 :(得分:0)

过滤后,将单词添加为字段,将“模式”添加到过滤后的关系中,然后按该字段进行分组。这将为您提供单词和一包推文。

twitter = LOAD 'Tweet.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
grouped = GROUP twitter BY (text,id);
filtered =  FILTER $1 BY (text MATCHES '.*word.*');
newfiltered = FOREACH filtered GENERATE 'word' AS pattern,filtered.text;
final = GROUP newfiltered BY pattern;
DUMP final;