Question

我对impala中的数据位置有疑问，假设我有10个数据节点的集群（在每个数据节点上有impalad），如果我在impala SELECT * FROM big_table where dt='2017' where blabla orderby blabla group by blabla中执行查询（假设它是一个大问题）。

并且假设分区下的文件（dt = 2017'）在dn 1,3,5中因此，如果我将执行查询，协调器将仅使用守护程序1,3,5用于数据位置，或者它将使用所有守护程序，其他守护程序将远程读取此数据？

Answer 1

对您的问题的简短回答：它仅使用守护程序1,3,5作为数据位置。

这通常是一个调度问题。 Impala在simple-scheduler.cc中做出了这样的决定。

// We schedule greedily in this order:
// cached collocated replicas > collocated replicas > remote (cached or not) replicas.

如果有一个后端并置，Impala将不会使用其他后端来扫描数据节点。对于没有扫描节点的片段，如分区聚合节点，impala将它们放在与其输入片段所在的位置相同的位置。

  // there is no leftmost scan; we assign the same hosts as those of our
  // leftmost input fragment (so that a partitioned aggregation fragment
  // runs on the hosts that provide the input data)

Impala数据位置

1 个答案: