Hive - 有没有办法进一步优化HiveQL查询?

时间:2018-03-07 13:51:02

标签: sql hadoop hive hiveql

我已经写了一个查询,在3月到4月期间找到美国10个最繁忙的机场。它产生所需的输出,但我想尝试进一步优化它。

是否有可以应用于查询的特定于HiveQL的优化? 这里适用GROUPING SETS吗?我是Hive的新手,现在这是我提出的最短的查询。

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

表格列如下:

机场

|iata|airport|city|state|country|

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|

4 个答案:

答案 0 :(得分:2)

按机场过滤(内部联接)并在UNION ALL之前进行聚合以减少传递给最终聚合减速器的数据集。具有连接的UNION ALL子查询应该并行运行,并且比在UNION ALL之后连接更大的数据集更快。

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

调整mapjoins并启用并行执行:

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

使用Tez和矢量化,调整映射器和缩减器并行度:https://stackoverflow.com/a/48487306/2700344

答案 1 :(得分:1)

如果您在<?php if( get_theme_mod('bonfire_touchy_hide_menu_button') == '') { ?> <div class="touchy-menu-button"> <div class="touchy-menu-tooltip"></div> <span class="touchy-menu-text-label-offset"> <?php if( get_theme_mod('bonfire_touchy_menu_icon') == '') { ?> <div class="touchy-default-menu"></div> <?php } else { ?> <i class="fa <?php echo get_theme_mod('bonfire_touchy_menu_icon') ; ?>"></i> <?php } ?> </span> </div> <?php } ?>

之前进行汇总,可能会有所帮助
if [[ "$(mysql -sse 'USE my_bd; SELECT COUNT(*) FROM my_table WHERE last_backup > '$last_backup'" -gt "0" ]]; then 
mysqldump --no-create-info --replace --skip-comments --skip-triggers --hex-blob  
testdb mytable --where="last_backup > '$last_backup'" | gzip -c > backup_file.gz 
fi

答案 2 :(得分:1)

我认为GROUPING SETS不适用于此,因为您只按一个字段进行分组。

来自Apache Wiki: “GROUP BY中的GROUPING SETS子句允许我们在同一记录集中指定多个GROUP BY选项。”

答案 3 :(得分:0)

你可以测试一下,但你可能会更好,所以你真的需要测试它并回来:

SELECT airports.airport,
SUM(
  CASE 
     WHEN T1.FlightsNum IS NOT NULL THEN 1
     WHEN T2.FlightsNum IS NOT NULL THEN 1
     ELSE 0
  END 
  ) AS Total_Flights
FROM airports
LEFT JOIN (SELECT  Origin AS Airport, FlightsNum 
    FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t1 
 on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum 
   FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t2
 on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC