Apache Pig字数统计程序

时间:2016-07-24 12:16:33

标签: apache-pig

在单词计数程序中如何找到猪中发生最多的单词和最少发生的单词。如何在这里使用MAX功能。

我看到的输出

(纳温,3) (是,5)

这里我需要的输出是"是"

3 个答案:

答案 0 :(得分:0)

你可以使用orderBy并限制: -

A =使用PigStorage()加载'文件'为(名称:chararray,count:int);

B =按计数排序A; - 默认情况下,它将是升序

C =限制B 1;

D = Foreach C生成名称;

转储D;

B =按计数desc的顺序A;

C =限制B 1;

D = Foreach C生成名称;

转储D;

答案 1 :(得分:0)

以下示例将帮助您获得前5名

infiles = load '/hdfs/bhavesh/Youtube_POC/Youtube/0222/{0,1,2,3,4}.txt' using PigStorage('\t') as 
 (videoid:chararray,uploader:chararray,age:int,category:chararray,length:int,views:int,rate:int,rating:int,comments:int,related_id:chararray);
files = FILTER infiles BY category is not null;
grpn_for_catagories = group files by category;
cnt_for_catagories = foreach grpn_for_catagories generate group, COUNT(files.videoid) as counting;
sorted_for_catagories_desc = order cnt_for_catagories by counting desc;
top5_for_catagories = limit sorted_for_catagories_desc 5;

详细说明可在

中找到

http://ybhavesh.blogspot.in/2015/08/proof-of-concept-or-poc-on-youtube-data.html

希望它能帮助!!! ...

答案 2 :(得分:0)

A =加载'文件'使用PigStorage()作为(名称:chararray,count:int);

B =按计数排序A;

C =限制B 1;

D = foreach C生成名称;

转储D;