Vertica有一个名为CONDITIONAL_CHANGE_EVENT的分析函数,其功能类似于
errors.New()
我想知道是否有一种简单的方法可以在Spark中模仿这个功能。请参阅以下简单示例
RAW数据
rank("column-name") OVER (PARTITION BY category ORDER BY revenue DESC) as rank
我想为上述输入生成一个group-id,方法是应用类似 rank(channel)over(分区为Session_ID,Device_ID order by Channel_Time),这在SPARK中不可用。< / p>
预期输出
Session_ID,Device_ID,Channel_Time,Channel
1,1,4/9/2018 15:00:00,A
1,1,4/9/2018 15:01:00,A
1,1,4/9/2018 15:02:00,B
1,1,4/9/2018 15:03:00,B
1,1,4/9/2018 15:04:00,B
1,1,4/9/2018 15:05:00,C
1,1,4/9/2018 15:06:00,C
1,1,4/9/2018 15:07:00,A
1,1,4/9/2018 15:08:00,A
1,1,4/9/2018 15:09:00,B
1,1,4/9/2018 15:10:00,B
为了实现这个目标,我必须做如下的4次转换,有更好/更简单的方法来做到这一点
Session_ID,Device_ID,Channel_Time,Channel,Group-ID
1,1,4/9/2018 15:00:00,A,1
1,1,4/9/2018 15:01:00,A,1
1,1,4/9/2018 15:02:00,B,2
1,1,4/9/2018 15:03:00,B,2
1,1,4/9/2018 15:04:00,B,2
1,1,4/9/2018 15:05:00,C,3
1,1,4/9/2018 15:06:00,C,3
1,1,4/9/2018 15:07:00,A,4
1,1,4/9/2018 15:08:00,A,4
1,1,4/9/2018 15:09:00,B,5
1,1,4/9/2018 15:10:00,B,5
Spark的输出
public class ConditionalTrueEvent {
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.appName(ConditionalTrueEvent.class.getName())
.master("local[*]").getOrCreate();
Dataset<Row> eventsDataSet = sparkSession.read()
.option("header", "true")
.csv("D:\\dev\\workspace\\java\\simple-kafka\\data\\test.csv");
eventsDataSet.createOrReplaceTempView("rawView");
sparkSession.sqlContext().sql("select * from rawView").show();
Dataset<Row> channel_changed = sparkSession.sqlContext().sql("select * , " +
" row_number() over group_1 as row_number_by_session_device , " +
" (case when (lag(Channel,1,'XXX') over group_1 != Channel) then 1 else 0 end ) as channel_changed " +
" from rawView " +
"window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
channel_changed.createOrReplaceTempView("channel_changed");
Dataset<Row> channel_changed_filled = sparkSession.sqlContext().sql("select * , " +
" ( case when channel_changed = 1 then row_number_by_session_device else 0 end ) as channel_changed_filled_row_num " +
" from channel_changed " +
"window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
channel_changed_filled.createOrReplaceTempView("channel_changed_filled");
Dataset<Row> channel_changed_final = sparkSession.sqlContext().sql("select * , " +
" ( case when channel_changed_filled_row_num = 0 then max(channel_changed_filled_row_num) over group_1 else channel_changed_filled_row_num end ) as Group_ID " +
" from channel_changed_filled " +
"window group_1 as (partition by Session_ID , Device_ID order by Channel_Time )");
channel_changed_final.createOrReplaceTempView("channel_changed_final");
sparkSession.close();
}
}