如何使用Hive或Pig查找字段中重复次数最多的值? 数据库值采用以下格式
cricket,Football,Basketball,Volleyball
cricket,Football,Basketball
Running cricket,Football
Basketball,Volleyball Football,Basketball,Volleyball,Baseball,Cycling
Running Shooting,Football,Running
我想从列表中找到最常见的游戏。
答案 0 :(得分:1)
对数据进行单词计数,然后使用最大计数获得单词。
lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
groupedsportcount = GROUP sportcount ALL;
maxvalue = FOREACH groupedsportcount GENERATE MAX(sportcount.total);
maxsportcount = FILTER sportcount BY (total == maxvalue.$0);
DUMP maxsportcount;
上述方法可以通过按顺序排序计数并将输出限制为1.但是,如果有多个最大计数,则不会返回所有具有最大计数的单词。
lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
orderedsportcount = ORDER sportcount BY total DESC;
maxsportcount= LIMIT orderedsportcount 1;
DUMP maxsportcount;
<强>输出强>
答案 1 :(得分:1)
我已将文本复制到m.txt文件中并完成以下操作以获得所需的输出。
TOKENIZE
我们将使用tokens = foreach str generate TOKENIZE(str);
dump tokens;
函数将一串单词(单个元组中的所有单词)分成一个单词组(单个元组中的每个单词)。
({(cricket),(Football),(Basketball),(Volleyball)})
({(cricket),(Football),(Basketball)})
({(Running),(cricket),(Football)})
({(Basketball),(Volleyball),(Football),(Basketball),(Volleyball),(Baseball),(Cycling)})
({(Running),(Shooting),(Football),(Running)})
输出是包的形式。
FLATTEN
tokens = foreach str generate FLATTEN(TOKENIZE(str));
dump tokens;
(cricket)
(Football)
(Basketball)
(Volleyball)
(cricket)
(Football)
(Basketball)
(Running)
(cricket)
(Football)
(Basketball)
(Volleyball)
(Football)
(Basketball)
(Volleyball)
(Baseball)
(Cycling)
(Running)
(Shooting)
(Football)
(Running)
:它不包括元组和包。对于元组,flatten用元组的字段代替元组。
当我们解开一个包时,我们会创建新的元组。
LOWER
为了获得更高的准确度,您可以尝试在一个案例中获取字符串/单词,这样您将获得良好的结果和正确的计数。因此,使用UPPER
您还可以使用tokens = foreach str generate FLATTEN(TOKENIZE(LOWER(str)));
(cricket)
(football)
(basketball)
(volleyball)
(cricket)
(football)
(basketball)
(running)
(cricket)
(football)
(basketball)
(volleyball)
(football)
(basketball)
(volleyball)
(baseball)
(cycling)
(running)
(shooting)
(football)
(running)
输出将是:
Group
grps = group tokens by $0;
dump grps;
:将数据分组为一个或多个关系。
$0
此处组创建了2个字段,一个位于$1
,另一个位于$0
。
S1
表示密钥,$0
是具有相同组密钥的元组(即(Cycling,{(Cycling)})
(Running,{(Running),(Running),(Running)})
(cricket,{(cricket),(cricket),(cricket)})
(Baseball,{(Baseball)})
(Football,{(Football),(Football),(Football),(Football),(Football)})
(Shooting,{(Shooting)})
(Basketball,{(Basketball),(Basketball),(Basketball),(Basketball)})
(Volleyball,{(Volleyball),(Volleyball),(Volleyball)})
:密钥字段)。
输出显示按键分组的字段:
COUNT
tuples($1)
函数计算key field($0)
的{{1}}个数。
cnt = foreach grps generate $0, COUNT($1);
dump cnt;
输出显示单词的计数:
(Cycling,1)
(Running,3)
(cricket,3)
(Baseball,1)
(Football,5)
(Shooting,1)
(Basketball,4)
(Volleyball,3)
ORDER
用于按降序排列元组。所以将获得最高的一个。
ord = order cnt by $1 desc;
dump ord;
订购后的输出:
(Football,5)
(Basketball,4)
(Running,3)
(cricket,3)
(Volleyball,3)
(Cycling,1)
(Baseball,1)
(Shooting,1)
Limit
:它将输出元组的数量限制为指定的数量。
maxWord = limit ord 1;
dump maxWord;
最终输出是
(Football,5)