Question

我根据列值（日期）在配置单元中划分数据。因此，每个日期在/ warehouse中都有其自己的目录。现在我有大约240个日期，总共7000万条记录平均分布在各个日期。

我还创建了另一个表，该表包含没有分区的相同数据。

当我用相同的查询查询两个表时，分区表并不总是优于未分区表。更具体地说，使用group by执行查询时，分区表的速度较慢。

select count(*) from not_partitioned_table where date > '2018-07-27' and date < '2018-08-27

这花费了22.146秒，计数为7427366。

select count(*) from partitioned_table where date > '2018-07-27' and date < '2018-08-27

这花费了22.723秒，并且还返回了7427366进行计数。

但是，当添加group by时，分区表的性能要比未分区表差。

select count(*) from not_partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;

这花费了39.733秒，并且返回了26,724行。

select count(*) from partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;

这花费了76.648秒秒，并且返回了26,724行。

为什么在这种情况下分区表会变慢？

编辑

这是我创建分区表的方式：

CREATE TABLE all_ads_from_csv_partitioned3(
id STRING,
...
)
PARTITIONED BY(datedecoded STRING)
STORED AS ORC;

在2018-10-08 15:34 /warehouse/tablespace/managed/hive/partitioned_table/下，有240个目录（240个分区），每个目录的格式为/warehouse/tablespace/managed/hive/partitioned_table/dated='the partitioned date'，每个分区大约包含10个存储桶。

Hive Groupby分区较慢

0 个答案: