我在HIVE中有一个表格结构,如下所示-
create table if not exists cdp_compl_status
(
EmpNo INT,
RoleCapability STRING,
EmpPUCode STRING,
SBUCode STRING,
CertificationCode STRING,
CertificationTitle STRING,
Competency STRING,
Certification_Type STRING,
Certification_Group STRING,
Contact_Based_Program_Y_N STRING,
ExamDate DATE,
Onsite_Offshore STRING,
AttendedStatus STRING,
Marks INT,
Result STRING,
Status STRING,
txtPlanCategory STRING,
SkillID1 INT,
Complexity STRING
)
CLUSTERED BY (Marks) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('created on' = '12 Aug');
现在,我想从表中的每个存储区查询MAX(MARKS)。如果我愿意-
SELECT MAX(MARKS) from cdp_compl_status;
它显示整个表格中的最大分数。有什么办法可以从每个存储桶中找出MAX(MARKS)
吗?
答案 0 :(得分:2)
由于您已将Table分为5个桶...
数据根据%函数分为多个存储区,例如:
marks%5==0
进入第一个存储桶
marks%5==1
放入第二个存储桶
marks%5==2
进入第三桶
marks%5==3
进入第四桶
marks%5==4
进入第五个桶
因此,您需要像这样编写5个查询:
Select max(marks) from cdp_compl_status where marks%5=0;
-在第一个存储分区中获得最大出价
我想应该这样做。
答案 1 :(得分:1)
使用表格样本:
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 1 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 2 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 3 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 4 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 5 out of 5 on marks);