猪拉丁语递归

时间:2015-12-05 10:13:38

标签: recursion apache-pig

我是猪拉丁语初学者,我需要完成一项任务,我应该识别巨魔帖子。此类帖子按帖子#likes/#replies的比例计算。因此,有必要为每个帖子(1)确定其答复,并且(2)递归地确定每个答复的所有答复。

在任务中声明可以使用Map Reduce,Pig Latin或Hive。但由于我不知道如何在纯Pig Latin中实现递归,我使用embeddedPig解决了它,我使用Java作为递归部分。所以我的问题是:是否有可能只使用Pig Latin实现这样的递归任务?如果是这样,有人能告诉我任何应用递归的小例子吗?

“测试输入”是一个小型静态社交网络,其中包含有帖子,喜欢等的用户。每一行都是主谓词对象的三倍,例如比尔 - 喜欢 - anyPost。现在每个帖子都可以与回复建立关系,这个回复也可以与回复有关(意思是有人回复了回复)。下面的我的猪拉丁代码尝试使用ratio输出帖子。问题是我不使用递归来获取每个回复的所有回复。

REGISTER RDFStorage.jar ;

indata = LOAD '$input_file' USING RDFStorage() AS            
(s:chararray,p:chararray,o:chararray) ;

likes = FILTER indata BY p == 'sib:like';
likes_group = GROUP likes BY o;
likes_grouped_count = FOREACH likes_group GENERATE group AS object,    
COUNT(likes) AS amount;

comments = FILTER indata BY STARTSWITH(s,'sibpo:') AND p == 'sioc:container_of';
comments_grouped = GROUP comments BY s;
comments_grouped_count = FOREACH comments_grouped GENERATE group AS subject,   
COUNT(comments) AS amount;

--GET creation date for all posts
 posts_dates_t = FILTER indata BY  STARTSWITH(s,'sibpo:') AND p == 'dc:created';
posts_dates = FOREACH posts_dates_t GENERATE s, p, REGEX_EXTRACT(o, '\\"([0-9]{4}-[0-9]{2}-[0-9]{2})T', 1) AS o;

--Get creation date for all comments
comments_dates_t =  FILTER indata BY  STARTSWITH(s,'sibc:') AND p == 'dc:created';
comments_dates =  FOREACH comments_dates_t GENERATE s, p, REGEX_EXTRACT(o, '\\"([0-9]{4}-[0-9]{2}-[0-9]{2})T', 1) AS o;

--Associate each comment to its corresponding post that has a creation date
 posts_comments = JOIN posts_dates BY s, comments BY s;

 --Join to get creation dates for all comments
 posts_comments_with_dates = JOIN posts_comments BY comments::o, comments_dates BY s;

--calculate the days between a post and each one of its comments 
all_dates = FOREACH posts_comments_with_dates GENERATE $0 AS post,ABS (DaysBetween(ToDate( $2 ),ToDate($8))) as lifetime_1 ;

--GROUP by post
all_dates_grouped = GROUP all_dates BY post; 

--Get the life time of a post, which is the maximum difference between a post and its comments
posts_lifetime = FOREACH all_dates_grouped GENERATE group as post, MAX (all_dates.lifetime_1) as lifetime_2;

combined = JOIN likes_grouped_count BY object, comments_grouped_count BY subject;
combined_dates= JOIN combined BY comments_grouped_count::subject,  posts_lifetime BY post;
combined_ratio = FOREACH combined_dates GENERATE likes_grouped_count::object AS post, (float)likes_grouped_count::amount/(float)comments_grouped_count::amount/1f AS ratio, posts_lifetime::lifetime_2 as lifetime ;

--Sort posts by ration first then by lifetime ascending
combined_ratio_sorted = ORDER combined_ratio BY ratio ASC, lifetime ASC;
outdata = LIMIT combined_ratio_sorted 50;

STORE outdata INTO '$output_file' USING PigStorage(',') ;

感谢您阅读并花时间。

0 个答案:

没有答案