Question

如何在Hive查询中使用row_number分区时提高性能。

    select *
    from
    (
    SELECT
                      '123'                                                                         AS run_session_id
                    , tbl1.transaction_id
                    , tbl1.src_transaction_id
                    , tbl1.transaction_created_epoch_time
                    , tbl1.currency
                    , tbl1.event_type
                    , tbl1.event_sub_type
                    , tbl1.estimated_total_cost
                    , tbl1.actual_total_cost
                    , tbl1.tfc_export_created_epoch_time
                    , tbl1.authorizer
                    , tbl1.acquirer
                    , tbl1.processor
                    , tbl1.company_code
                    , tbl1.country_of_account
                    , tbl1.merchant_id
                    , tbl1.client_id
                    , tbl1.ft_id
                    , tbl1.transaction_created_date
                    , tbl1.event_pst_time
                    , tbl1.extract_id_seq
                    , tbl1.src_type
                    , ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC)   AS seq_num       -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way

                  FROM nest.nest_cost_events tbl1                                --<hiveFinalDB>--                           -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
                  WHERE extract_id_seq     BETWEEN 275 - 60
                                           AND 275
                    AND event_type    in('ACT','CBR','SKU','CAL','KIT','BXT' )) tbl1
    where seq_num=1;

此表由src_type分区。现在需要20个m来处理154M记录。我希望减少到10万。

有什么建议吗？

由于

Row_number按性能划分

0 个答案: