如何优化大型数据集的查询?

时间:2017-04-21 16:22:17

标签: performance hadoop hive bigdata

我的原始查询 -

CREATE TABLE admin.FctPrfitAmt_rpt AS 
SELECT rcn.* FROM 
(SELECT t1.* FROM (SELECT * FROM admin.FctPrfitAmt t2 WHERE t2.scenario_id NOT IN(SELECT DISTINCT t3.scenario_id FROM admin.FctPrfitAmt_incr t3)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr) t1) rcn;

问题是目前这个查询需要花费大量时间,因为涉及的记录数量很多。

有没有办法调整此查询?

我试过这种方法,但它不起作用 -

CREATE TABLE admin.FctPrfitAmt_rpt AS
SELECT * FROM admin.FctPrfitAmt t2 
WHERE t2.scenario_id NOT exists (SELECT 1 from  admin.FctPrfitAmt_incr t3 where t2.scenario_id = t3.scenario_id)
UNION ALL
SELECT * FROM admin.FctPrfitAmt_incr 

错误 - 看起来像#34;不存在"我的Hive版本不支持,所以对于我的方法,我得到以下错误:

  

编译语句时出错:FAILED:ParseException行3:25无法识别' NOT'附近的输入。 '存在' '('在表达式规范中

3 个答案:

答案 0 :(得分:2)

最好离开加入'select in'部分中的2个表并过滤掉连接键上非空的行。

cordova.plugins.notification.local.on("click", function (notification, state) {

    if (notification.data == null || notification.data == undefined) { }
    else if (notification.data.localeCompare('') == 0) {
    } else {


    }


}, this);

NotificationTemplate = function (sheduleTime, id, title, text, process,rowId) {
var sound = device.platform == 'Android' ? 'file://sound.mp3' : 'file://beep.caf';

cordova.plugins.notification.local.schedule({
    id: id,
    title: title,
    text: text,
    at: sheduleTime,
    sound: sound,
    data: { RowId: rowId, proc: process }
});

答案 1 :(得分:2)

  1. 你的语法错了。 NOT EXISTS不应以t2.scenario_id
  2. 开头
  3. 正如我们所看到的,scenario_id在两个表上都有偏差,这会在连接上创建一个巨大的产品。
  4. select  * 
    from    admin.FctPrfitAmt   pa
    where   not exists 
            (
                select  null
    
                from   (select  distinct 
                                pfa.scenario_id 
    
                        from    admin.FctPrfitAmt_incr  pfa
                        ) pfa
    
                where   pfa.scenario_id = 
                        pa.scenario_id
            )
    
    union all
    
    select  * 
    from    admin.FctPrfitAmt_incr 
    

答案 2 :(得分:0)

select  *

from   (select  *
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select *,1 as tab from master.FctPrfitAmt
                union all   select *,2 as tab from master.FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1
;

If all your data is consist of primitive types (no arrays, maps etc.),
you can use the following query:

select  inline(array(original_rowset))

from   (select  original_rowset
               ,tab
               ,max(tab) over (partition by scenario_id) as max_tab

        from    (           select struct(*) as original_rowset,scenario_id,1 as tab from FctPrfitAmt
                union all   select struct(*) as original_rowset,scenario_id,2 as tab from FctPrfitAmt_incr
                ) t
        ) t

where   tab     = 2
     or max_tab = 1