Spark SQL中的任何内容类似于CONDITIONAL_CHANGE_EVENT?

时间:2018-05-14 15:21:23

标签: apache-spark apache-spark-sql

Vertica有一个名为CONDITIONAL_CHANGE_EVENT的分析函数,其功能类似于

errors.New()

https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/CONDITIONAL_CHANGE_EVENTAnalytic.htm%3FTocPath%3DSQL%2520Reference%2520Manual%7CSQL%2520Functions%7CAnalytic%2520Functions%7C_____9

我想知道是否有一种简单的方法可以在Spark中模仿这个功能。请参阅以下简单示例

RAW数据

rank("column-name") OVER (PARTITION BY category ORDER BY revenue DESC) as rank

我想为上述输入生成一个group-id,方法是应用类似 rank(channel)over(分区为Session_ID,Device_ID order by Channel_Time),这在SPARK中不可用。< / p>

预期输出

Session_ID,Device_ID,Channel_Time,Channel
1,1,4/9/2018 15:00:00,A
1,1,4/9/2018 15:01:00,A
1,1,4/9/2018 15:02:00,B
1,1,4/9/2018 15:03:00,B
1,1,4/9/2018 15:04:00,B
1,1,4/9/2018 15:05:00,C
1,1,4/9/2018 15:06:00,C
1,1,4/9/2018 15:07:00,A
1,1,4/9/2018 15:08:00,A
1,1,4/9/2018 15:09:00,B
1,1,4/9/2018 15:10:00,B

为了实现这个目标,我必须做如下的4次转换,有更好/更简单的方法来做到这一点

Session_ID,Device_ID,Channel_Time,Channel,Group-ID
1,1,4/9/2018 15:00:00,A,1
1,1,4/9/2018 15:01:00,A,1
1,1,4/9/2018 15:02:00,B,2
1,1,4/9/2018 15:03:00,B,2
1,1,4/9/2018 15:04:00,B,2
1,1,4/9/2018 15:05:00,C,3
1,1,4/9/2018 15:06:00,C,3
1,1,4/9/2018 15:07:00,A,4
1,1,4/9/2018 15:08:00,A,4
1,1,4/9/2018 15:09:00,B,5
1,1,4/9/2018 15:10:00,B,5

Spark的输出

public class ConditionalTrueEvent {

    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder()
                .appName(ConditionalTrueEvent.class.getName())
                .master("local[*]").getOrCreate();

        Dataset<Row> eventsDataSet = sparkSession.read()
                .option("header", "true")
                .csv("D:\\dev\\workspace\\java\\simple-kafka\\data\\test.csv");
        eventsDataSet.createOrReplaceTempView("rawView");
        sparkSession.sqlContext().sql("select * from rawView").show();

        Dataset<Row> channel_changed = sparkSession.sqlContext().sql("select * , " +
                " row_number() over group_1 as row_number_by_session_device , " +
                " (case when (lag(Channel,1,'XXX') over group_1 != Channel) then 1 else 0 end ) as channel_changed " +
                " from rawView " +
                "window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
        channel_changed.createOrReplaceTempView("channel_changed");

        Dataset<Row> channel_changed_filled = sparkSession.sqlContext().sql("select * , " +
                "  ( case when channel_changed = 1 then row_number_by_session_device else 0 end ) as channel_changed_filled_row_num " +
                " from channel_changed " +
                "window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
        channel_changed_filled.createOrReplaceTempView("channel_changed_filled");

        Dataset<Row> channel_changed_final = sparkSession.sqlContext().sql("select * , " +
                " ( case when channel_changed_filled_row_num = 0 then max(channel_changed_filled_row_num) over group_1 else channel_changed_filled_row_num end ) as Group_ID " +
                " from channel_changed_filled " +
                "window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
        channel_changed_final.createOrReplaceTempView("channel_changed_final");

        sparkSession.close();
    }
}

0 个答案:

没有答案