我已经加入了多个表,总行数约为250亿。最重要的是,我做的聚集。以下是我的配置单元设置,用于生成最终输出。我不太确定如何调整查询并使其运行得更快。目前,我做的试验和错误,看看是否能产生一些成果,但似乎并没有被working.Mappers运行速度更快,但减速正在采取永远玩完。任何人都可以分享您对此怎么看?谢谢。
SET hive.execution.engine=tez;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.qubole.cleanup.partial.data.on.failure=true;
SET hive.tez.container.size=8192;
SET tez.task.resource.memory.mb=8192;
SET tez.task.resource.cpu.vcores=2;
SET hive.mapred.mode=nonstrict;
SET hive.qubole.dynpart.use.prefix=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled =true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET mapred.reduce.tasks = -1;
SET hive.auto.convert.join.noconditionaltask.size=2730;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=8053063680;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET mapreduce.job.reduce.slowstart.completedmaps=0.8;
set hive.tez.auto.reducer.parallelism = true;
set hive.exec.reducers.max=100;
set hive.exec.reducers.bytes.per.reducer=1024000000;
SQL:
SELECT D.d
,D.b
,COUNT(DISTINCT A.x) AS cnt
,SUM(c) AS sum
FROM A
LEFT JOIN
B
ON A.a = B.b
LEFT JOIN
C
ON B.b = C.c
JOIN
D
ON A.a >= D.d
AND A.a <= D.d
GROUP BY 1,2
CLUSTER BY D.d;
答案 0 :(得分:1)
还没有查询计划,所以也许还有其他东西,但是这些设置无疑限制了reducer的并行性:
set hive.exec.reducers.max=100;
set hive.exec.reducers.bytes.per.reducer=1024000000;
我建议增加允许的reducer数量并减少每个reducer的字节数,这将增加reducer的并行度:
set hive.exec.reducers.max=5000;
set hive.exec.reducers.bytes.per.reducer=67108864;
Hive 1.2.0+还提供自动重写optimization for count(distinct)。检查此设置,默认情况下应为true
:
hive.optimize.distinct.rewrite=true;
如果您查询到最后几个减速器卡住的情况,则有一个skew in join keys