MIn max group wise和过滤器没有加入猪

时间:2015-07-28 17:59:03

标签: hadoop apache-pig hadoop2

我试图找到每组的(最大+最小)/ 2。以下是我的架构

UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}

我试图通过url字段对其进行分组

byUrl = GROUP UrlXpathsCount by url;

我试图通过以下方式找到(最大+最小)/ 2。

midRangeByUrl = FOREACH byUrl{
    urls_desc = order UrlXpathsCount by urlpath_count desc;
    urls_max = limit urls_desc 1;
    urls_asc = order UrlXpathsCount by urlpath_count asc;
    urls_min = limit urls_asc 1;

    GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};

以下是midRangeByUrl

的架构
midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}

我现在面临的问题是添加FLATTEN(组),FLATTEN(urls_max),FLATTEN(urls_min)给了我很多我不想要的组合。

我希望每组得到最大+最小/ 2。

为此,我通过以下

投影max和min的urlpath_count
computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;

我将通过以下

加入这两个表格
/* Join computeMidRange  and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
    UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;

然后过滤应用过滤器

templates = FILTER midRangeOut by urlpath_count > midRange;

我想避开midRangeJoin。通过某种方式计算midRangeByUrl并在没有连接的情况下投射以下字段url,urlpath_count,leafpathstr,(min + max)/ 2。

请帮我解决这个问题。 感谢

1 个答案:

答案 0 :(得分:2)

您可以使用内置的$('#myhidden').val(myarray.split("|")); //set "0|1".split("|") - creates array like [0,1] myarray = $('#myhidden').val().join("|"); //get [0,1].join("|") - creates string like "0|1" MAX UDF:

MIN

这将完全符合您的要求,没有嵌套的foreach或连接。我将计算分为UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long); B = GROUP UrlXpathsCount by url; C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count, MIN(UrlXpathsCount.urlpath_count) as min_count; D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val; C以避免极长的行,但您也可以在一行中完成。只需记住将值转换为D,因为doubleurlpath_count,所以如果你没有投出任何小数,你就不会得到任何小数。