猪:分组依据,平均分和排序依据

时间:2015-06-15 20:09:38

标签: sorting group-by apache-pig average

我是猪新手,我有一个文本文件,其中每行包含以下格式的不同信息记录:

pairs :: [a] -> [(a,a)]
pairs [] = []
pairs [_] = []
pairs (a:b:xs) = (a,b) : pairs xs

func' :: [Int] -> [Int]
func' xs = concat $ zipWith (\n (a,b) -> [n,a,b]) [1,2..] $ pairs xs

func' [11,12..20]
[1,11,12,2,13,14,3,15,16,4,17,18,5,19,20]

例如:

name, year, count, uniquecount

我想按照其唯一名称对所有记录进行分组,然后为每个唯一名称计算count / uniquecount,最后按此计算值对输出进行排序。

以下是我一直在尝试的内容:

Zverkov winced_VERB 2004    8   8
Zverkov winced_VERB 2008    4   4
Zverkov winced_VERB 2009    1   1
zvlastni _ADV_  1913    1   1
zvlastni _ADV_  1928    2   2
zvlastni _ADV_  1929    3   2

1 个答案:

答案 0 :(得分:0)

似乎我的原始代码确实产生了所需的输出,只有一个小的改动:

bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;