HiveQL的PySpark DataFrame代码需要3-4个小时

时间:2018-12-19 20:10:17

标签: apache-spark dataframe join rank

以下HiveQL代码大约需要3到4个小时,我正在尝试将其有效地转换为pyspark数据帧代码。任何数据框专家的意见都值得赞赏。

INSERT overwrite table dlstage.DIBQtyRank_C11 PARTITION(fiscalyearmonth) 
    SELECT * FROM 
            (SELECT a.matnr, a.werks, a.periodstartdate, a.fiscalyear, a.fiscalmonth,b.dy_id, MaterialType,(COALESCE(a.salk3,0)) salk3,(COALESCE(a.lbkum,0)) lbkum, sum(a.valuatedquantity) AS valuatedquantity, sum(a.InventoryValue) AS InventoryValue, 
            rank() over (PARTITION by dy_id, werks, matnr order by a.max_date DESC) rnk, sum(stprs) stprs, max(peinh) peinh, fcurr,fiscalyearmonth 
            FROM dlstage.DIBmsegFinal a 
            LEFT JOIN dlaggr.dim_fiscalcalendar b ON a.periodstartdate=b.fmth_begin_dte WHERE a.max_date >= b.fmth_begin_dte AND a.max_date <= b.dy_id and 
            fiscalYearmonth = concat(fyr_id,lpad(fmth_nbr,2,0)) 
            GROUP BY a.matnr, a.werks,dy_id, max_date, a.periodstartdate, a.fiscalyear, a.fiscalmonth, MaterialType, fcurr, COALESCE(a.salk3,0), COALESCE(a.lbkum,0),fiscalyearmonth) a 
            WHERE a.rnk=1 and a.fiscalYear = '%s'" %(year) + " and a.fiscalmonth ='%s'" %(mnth)

0 个答案:

没有答案