蜂巢和蜂巢-llap之间的结果集不一致

时间:2020-07-30 17:51:45

标签: hive azure-hdinsight qubole spark-hive

我们在HDI 4.0上使用Hive 3.1.x群集,其中1个是LLAP,另一个是Just HIVE。

我们已经在两个集群上创建了一个托管表,其行数为272409

在两个群集上合并之前

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-26 23:42:17.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

Based on the delta, we'd perform a merge operation (which updates 17 rows).

在配置单元-hap群集上合并后(压缩之前)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272392              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

在配置单元-蜂巢群集上合并后(压缩后)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

在仅配置单元群集上合并后(没有压缩增量)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

这是观察到的不一致

但是,在对hive-llap压缩表之后,结果集不一致是看不到的,两个集群都返回相同的结果。

We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.

We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.

We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result

这很奇怪也很奇怪,有人可以在这里帮忙吗

2 个答案:

答案 0 :(得分:1)

我们在HDInsight Hive搭接群集中也遇到了类似的问题。将hive.llap.io.enabled设置为false后,问题解决了

答案 1 :(得分:0)

Qubole尚不支持Hive LLAP。 (但是,我们(在Qubole)正在评估将来是否支持)