使用Hive或Pig在字段中查找最重复的值

时间:2016-04-11 07:46:31

标签: hive apache-pig bigdata

如何使用Hive或Pig查找字段中重复次数最多的值? 数据库值采用以下格式

cricket,Football,Basketball,Volleyball 
cricket,Football,Basketball
Running cricket,Football
Basketball,Volleyball Football,Basketball,Volleyball,Baseball,Cycling
Running Shooting,Football,Running

我想从列表中找到最常见的游戏。

2 个答案:

答案 0 :(得分:1)

对数据进行单词计数,然后使用最大计数获得单词。

lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
groupedsportcount = GROUP sportcount ALL;
maxvalue = FOREACH groupedsportcount  GENERATE MAX(sportcount.total);
maxsportcount = FILTER sportcount BY (total == maxvalue.$0);
DUMP maxsportcount;

上述方法可以通过按顺序排序计数并将输出限制为1.但是,如果有多个最大计数,则不会返回所有具有最大计数的单词。

lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
orderedsportcount = ORDER sportcount BY total DESC;
maxsportcount= LIMIT orderedsportcount 1;
DUMP maxsportcount;

<强>输出

Output

答案 1 :(得分:1)

我已将文本复制到m.txt文件中并完成以下操作以获得所需的输出。

TOKENIZE

我们将使用tokens = foreach str generate TOKENIZE(str); dump tokens; 函数将一串单词(单个元组中的所有单词)分成一个单词组(单个元组中的每个单词)。

({(cricket),(Football),(Basketball),(Volleyball)})
({(cricket),(Football),(Basketball)})
({(Running),(cricket),(Football)})
({(Basketball),(Volleyball),(Football),(Basketball),(Volleyball),(Baseball),(Cycling)})
({(Running),(Shooting),(Football),(Running)})

输出是包的形式。

FLATTEN

tokens = foreach str generate FLATTEN(TOKENIZE(str)); dump tokens; (cricket) (Football) (Basketball) (Volleyball) (cricket) (Football) (Basketball) (Running) (cricket) (Football) (Basketball) (Volleyball) (Football) (Basketball) (Volleyball) (Baseball) (Cycling) (Running) (Shooting) (Football) (Running) :它不包括元组和包。对于元组,flatten用元组的字段代替元组。 当我们解开一个包时,我们会创建新的元组。

LOWER

为了获得更高的准确度,您可以尝试在一个案例中获取字符串/单词,这样您将获得良好的结果和正确的计数。因此,使用UPPER

将它们转换为小写

您还可以使用tokens = foreach str generate FLATTEN(TOKENIZE(LOWER(str)));

将其转换为大写字母
(cricket)
(football)
(basketball)
(volleyball)
(cricket)
(football)
(basketball)
(running)
(cricket)
(football)
(basketball)
(volleyball)
(football)
(basketball)
(volleyball)
(baseball)
(cycling)
(running)
(shooting)
(football)
(running)

输出将是:

Group

grps = group tokens by $0; dump grps; :将数据分组为一个或多个关系。

$0

此处组创建了2个字段,一个位于$1,另一个位于$0S1表示密钥,$0是具有相同组密钥的元组(即(Cycling,{(Cycling)}) (Running,{(Running),(Running),(Running)}) (cricket,{(cricket),(cricket),(cricket)}) (Baseball,{(Baseball)}) (Football,{(Football),(Football),(Football),(Football),(Football)}) (Shooting,{(Shooting)}) (Basketball,{(Basketball),(Basketball),(Basketball),(Basketball)}) (Volleyball,{(Volleyball),(Volleyball),(Volleyball)}) :密钥字段)。

输出显示按键分组的字段:

COUNT

tuples($1)函数计算key field($0)的{​​{1}}个数。

cnt = foreach grps generate $0, COUNT($1);
dump cnt;

输出显示单词的计数:

(Cycling,1)
(Running,3)
(cricket,3)
(Baseball,1)
(Football,5)
(Shooting,1)
(Basketball,4)
(Volleyball,3)

ORDER用于按降序排列元组。所以将获得最高的一个。

ord = order cnt by $1 desc;
dump ord;

订购后的输出:

(Football,5)
(Basketball,4)
(Running,3)
(cricket,3)
(Volleyball,3)
(Cycling,1)
(Baseball,1)
(Shooting,1)

Limit:它将输出元组的数量限制为指定的数量。

maxWord = limit ord 1;
dump maxWord;

最终输出是

(Football,5)