聚合和窗口函数从Spark-Sql转换为Hive

时间:2018-12-24 15:38:02

标签: hive apache-spark-sql hiveql

我正在将spark sql脚本转换为蜂巢脚本。 但是,我在聚合和窗口函数转换方面遇到了一些问题。

我尝试转换的spark sql脚本表示为:

val new_dataset = old_dataset.withColumn("flag_1", when($"flag_1".isNull, lit(0)).otherwise($"flag_1")).withColumn("flag_2", when($"flag_2".isNull, lit(0)).otherwise($"flag_2")).withColumn("flag_3", when($"flag_3".isNull, lit(0)).otherwise($"flag_3"))

val userWindow_1 = Window.partitionBy("person_name").orderBy("start_date","close_date")
val userWindow_2 = Window.partitionBy("person_name","session")

val new_dataset_session = (coalesce(datediff($"start_date", lag($"close_date", 1).over(userWindow_1)),lit(0)) > 1).cast("bigint")

val new_dataset_session_sum = new_dataset.withColumn("session", sum(new_dataset_session).over(userWindow_1))
val newresult = new_dataset_session_sum.withColumn("begin_date", min($"start_date").over(userWindow_2)).withColumn("end_date", max($"close_date").over(userWindow_2)).withColumn("flag_1", max($"flag_1").over(userWindow_2)).withColumn("flag_2", max($"flag_2").over(userWindow_2)).withColumn("flag_3", max($"flag_3").over(userWindow_2))

val new_dataset_session_agg = newresult.groupBy("person_name","session").agg(min("begin_date").as("begin_date"),max("end_date").as("end_date"),max("flag_1").as("flag_1"),max("flag_2").as("flag_2"),max("flag_3").as("flag_3"))

而且,我已经将此部分转换为蜂巢脚本,如下所示:

new_dataset AS (
SELECT DISTINCT 
    person_name, 
    person_id, 
    start_date,
    close_date,
    nvl(flag_1, 0) AS flag_1,        
    nvl(flag_2, 0) AS flag_2, 
    nvl(flag_3, 0) AS flag_3  
FROM old_dataset 
), 
new_dataset_session AS (      
SELECT 
    person_name, 
    person_id, 
    start_date, 
    close_date, 
    prev_enddate, 
    DATEDIFF(start_date, prev_enddate) AS date_diff, 
    CASE WHEN DATEDIFF(start_date, prev_enddate) > 1 THEN 1 ELSE 0 END datediff_flag 
FROM 
    (
    SELECT person_name, person_id, start_date, close_date, lag(close_date) over(PARTITION BY person_name ORDER BY start_date,close_date) AS prev_enddate
    FROM new_dataset  
    ) t
),
new_dataset_session_sum AS (      
SELECT 
    person_name, 
    person_id,  
    SUM(date_diff) OVER (PARTITION BY person_name ORDER BY start_date, close_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_datediff_flg
FROM new_dataset_session 
WHERE datediff_flag = '1'
), 
newresult AS (      
SELECT 
    e.person_name, 
    e.start_date, 
    e.close_date, 
    e.person_id, 
    e.flag_1,        
    e.flag_2, 
    e.flag_3, 
    f.sum_datediff_flg  
FROM new_dataset e 
JOIN new_dataset_session_sum f ON (e.person_name = f.person_name AND e.person_id = f.person_id) 
),
new_dataset_session_agg AS (      
SELECT 
    person_name, 
    person_id, 
    MIN (start_date) AS begin_date, 
    MAX (close_date) AS end_date, 
    MAX (flag_1) AS flag_1,        
    MAX (flag_2) AS flag_2, 
    MAX (flag_3) AS flag_3, 
    sum_datediff_flg 
FROM newresult 
GROUP BY person_name, 
        person_id, 
        sum_datediff_flg
),

好像没有正确转换。我在这里想念的是什么?

0 个答案:

没有答案