Hive中ANALYZE TABLE命令收集的统计信息与表结果之间的差异

时间:2019-09-12 20:14:40

标签: hive statistics hiveql

我试图使用ANALYZE TABLE table_name分区(partition-spec =)COMPUTE STATISTICS FOR COLUMNS命令。

我无法理解表中提供的结果与针对同一表,列和分区的select语句中的计算之间的区别。

例如,给定特定分区,我将“计算统计信息”应用于“列”,然后使用“描述格式” table_name.var_name PARTITION(partition-spec =)看到以下结果:

+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
|        col_name         |       data_type       |          min          |          max          |       num_nulls       |    distinct_count     |      avg_col_len      |      max_col_len      |       num_trues       |      num_falses       |        comment        |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
| # col_name              | data_type             | min                   | max                   | num_nulls             | distinct_count        | avg_col_len           | max_col_len           | num_trues             | num_falses            | comment               |
|                         | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  |
| sub_id                  | bigint                | 100000000003631773    | 112330000086219636    | 0                     | 403024                |                       |                       |                       |                       | from deserializer     |
+-------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+

但是使用table_name中的SELECT COUNT(DISTINCT SUB_ID),其中partition = yyyymmdd,我得到以下结果:

+---------+--+
|   qid   |
+---------+--+
| 465001  |
+---------+--+

有人知道为什么在结果中出现这种差异吗?

谢谢!

0 个答案:

没有答案