我正在将spark sql脚本转换为蜂巢脚本。 但是,我在聚合和窗口函数转换方面遇到了一些问题。
我尝试转换的spark sql脚本表示为:
val new_dataset = old_dataset.withColumn("flag_1", when($"flag_1".isNull, lit(0)).otherwise($"flag_1")).withColumn("flag_2", when($"flag_2".isNull, lit(0)).otherwise($"flag_2")).withColumn("flag_3", when($"flag_3".isNull, lit(0)).otherwise($"flag_3"))
val userWindow_1 = Window.partitionBy("person_name").orderBy("start_date","close_date")
val userWindow_2 = Window.partitionBy("person_name","session")
val new_dataset_session = (coalesce(datediff($"start_date", lag($"close_date", 1).over(userWindow_1)),lit(0)) > 1).cast("bigint")
val new_dataset_session_sum = new_dataset.withColumn("session", sum(new_dataset_session).over(userWindow_1))
val newresult = new_dataset_session_sum.withColumn("begin_date", min($"start_date").over(userWindow_2)).withColumn("end_date", max($"close_date").over(userWindow_2)).withColumn("flag_1", max($"flag_1").over(userWindow_2)).withColumn("flag_2", max($"flag_2").over(userWindow_2)).withColumn("flag_3", max($"flag_3").over(userWindow_2))
val new_dataset_session_agg = newresult.groupBy("person_name","session").agg(min("begin_date").as("begin_date"),max("end_date").as("end_date"),max("flag_1").as("flag_1"),max("flag_2").as("flag_2"),max("flag_3").as("flag_3"))
而且,我已经将此部分转换为蜂巢脚本,如下所示:
new_dataset AS (
SELECT DISTINCT
person_name,
person_id,
start_date,
close_date,
nvl(flag_1, 0) AS flag_1,
nvl(flag_2, 0) AS flag_2,
nvl(flag_3, 0) AS flag_3
FROM old_dataset
),
new_dataset_session AS (
SELECT
person_name,
person_id,
start_date,
close_date,
prev_enddate,
DATEDIFF(start_date, prev_enddate) AS date_diff,
CASE WHEN DATEDIFF(start_date, prev_enddate) > 1 THEN 1 ELSE 0 END datediff_flag
FROM
(
SELECT person_name, person_id, start_date, close_date, lag(close_date) over(PARTITION BY person_name ORDER BY start_date,close_date) AS prev_enddate
FROM new_dataset
) t
),
new_dataset_session_sum AS (
SELECT
person_name,
person_id,
SUM(date_diff) OVER (PARTITION BY person_name ORDER BY start_date, close_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_datediff_flg
FROM new_dataset_session
WHERE datediff_flag = '1'
),
newresult AS (
SELECT
e.person_name,
e.start_date,
e.close_date,
e.person_id,
e.flag_1,
e.flag_2,
e.flag_3,
f.sum_datediff_flg
FROM new_dataset e
JOIN new_dataset_session_sum f ON (e.person_name = f.person_name AND e.person_id = f.person_id)
),
new_dataset_session_agg AS (
SELECT
person_name,
person_id,
MIN (start_date) AS begin_date,
MAX (close_date) AS end_date,
MAX (flag_1) AS flag_1,
MAX (flag_2) AS flag_2,
MAX (flag_3) AS flag_3,
sum_datediff_flg
FROM newresult
GROUP BY person_name,
person_id,
sum_datediff_flg
),
好像没有正确转换。我在这里想念的是什么?