如何将单元格拆分为sepatare行并查找小型汇总值

时间:2016-08-10 13:48:45

标签: hadoop apache-pig

我有以下数据集:

Movies : moviename, genre1, genre2, genre3 ..... genre19  

(以上所有类型的值均为0或1,1表示该电影属于该类型)
现在我想找哪部电影的流派最少?

我尝试了下面的Pig脚本:

items = load 'path' using PigStorage('|') as (mName:chararray,g1:int,g2:int,g3:int,g4:int,g5:int,g6:int,g7:int,g8:int,g9:int,g10:int,g11:int,g12:int,g13:int,g14:int,g15:int,g16:int,g17:int,g18:int,g19:int);

sumGenre = foreach items generate mName, g1+g2+g3+g4+g5+g6+g7+g8+g9+g10+g11+g12+g13+g14+g15+g16+g17+g18+g19 as sumOfGenres;

groupAll = group sumGenre All;

在下一步中使用MIN(sumGenre.sumofGenres),我可以得到一个MIN值的类型,但我要找的是得到一个最少没有的moviename。类型,以及该电影的类型数量。

有人可以帮忙吗? 我想知道有没有其他简单的方法来得到g1 + g2 + ... g19的总和?
2.输出:类型最少的电影?

2 个答案:

答案 0 :(得分:1)

groupAll

之后
r1 = minGenre = foreach groupAll generate MIN(sumGenre.sumOfGenres) as minG;

r1 minGsumGenre sumOfGenres之间的外部联接保留为public class DynRowSum extends EvalFunc<Integer> { public Integer exec(Tuple v) throws IOException { List<Object> olist = v.getAll(); int sum = 0; int cnt=0; for( Object o : olist){ cnt++; if (cnt!=1) { int val= (Integer)o; sum = sum + val; } } return new Integer(sum); } } ;

获取具有最少类型的电影列表..

希望这会有所帮助..

对于动态行字段求和,你可以像这样使用UDF ..

grunt>sumGenre = foreach items generate mName,DynRowSum(*) as sumOfGenres;

在猪更新这样的脚本..

Uncaught SecurityError: Blocked a frame with origin "https://www.google.com" from accessing a frame with origin "http://localhost".  The frame requesting access has a protocol of "https", the frame being accessed has a protocol of "http". Protocols must match.

如果类型增加或减少代码保持不变,您将获得优势。

答案 1 :(得分:0)

a = LOAD 'path';
b = FOREACH a generate FLATTEN(STRSPLIT($0, '\\|'));
c = FOREACH b generate $0 as movie, FLATTEN(TOBAG(*)) as genre;
d = FILTER c BY movie!=genre;
e = GROUP d BY $0;
f = FOREACH e GENERATE group, SUM(d);
i = ORDER f BY $1;
j = LIMIT i 1;