我有一个表(data_table),其中包含多个分区列年/月/月密钥。
目录看起来像年份= 2017 /月= 08 / monthkey = 2017-08 / files.parquet
以下哪个查询会更快?
select count(*) from data_table where monthkey='2017-08'
或
select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'
我认为hadoop在第一种情况下找到所需目录的初始时间会更多。但是想确认一下
答案 0 :(得分:3)
查找相关分区是一个Metastore操作,不是文件系统操作 通过扫描目录来查询元数据和不 第一个用例的元数据查询很可能比第二个用例更快,但无论如何我们在这里讨论的是几分之一秒。
create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
explain dependency select count(*) from t100k where xy='100-1000';
针对Metastore发出的查询:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';
针对Metastore发出的查询:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100)
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )
答案 1 :(得分:0)
由于评论会更改格式,因此会在此处发布。 请接受@ Dudu的回复。请在Metastore DB(我的情况下是mysql)上执行以下命令:
mysql> select part_id, location, tbl_id, part_name from PARTITIONS as P inner join SDS as S on P.SD_ID = S.SD_ID where P.TBL_ID = 472;
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| part_id | location | tbl_id | part_name |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| 7 | hdfs://hostname:8020/tmp/multi_part/2011/01/2011-01 | 472 | year=2011/month=1/year_month=2011-01 |
| 9 | hdfs://hostname:8020/tmp/multi_part/2012/01/2012-01 | 472 | year=2012/month=1/year_month=2012-01 |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
2 rows in set (0.00 sec)
两个查询的位置将从同一个hdfs目录中提取数据。 唯一的速度差异来自于在Dudu的答案中已经解释过的Metastore数据库查询。