我有这样的别名A
:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
其中第2和第3个字段定义为
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
现在我想做
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
和keywords
应该是关键字中的map
到权重之和。
,例如,
1 k1,k2,k3
1 k2,k4
应该变成
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
我已经在使用datafu,但我愿意接受任何替代方案......
答案 0 :(得分:0)
我做
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
然后
A_counts_gr = group A by (cookie,keyword);
和
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
然后一个人可以通过cookie分组来获得你想要的包......再次通过cookie分组后会有一个包,而不是你可以将这个包变成带有datafu的地图......