Spark SQL为每个子组分配一个序列号

时间:2017-08-09 19:32:13

标签: apache-spark apache-spark-sql

从下表中我们可以看到,用户正在观看电视频道。在Spark SQL中,我想创建一个名为" CHANGE_SEQUENCE"的列,其中逻辑是

CASE WHEN ROW_NUMBER = 1 
THEN 1 
ELSE 
   CASE WHEN CHANNEL_CHANGED = true 
       THEN ((lag(CHANGE_SEQUENCE,1, CHANGE_SEQUENCE) over(partition by CUSTOMER_ID order by ROW_NUMBER)) + 1)  
       ELSE (lag(CHANGE_SEQUENCE,1, CHANGE_SEQUENCE) over(partition by CUSTOMER_ID order by ROW_NUMBER))
       END 
   END 
 AS CHANGE_SEQUENCE

Input
CUSTOMER_ID | TV_CHANNEL_ID | PREV_CHANNEL_ID | ROW_NUMBER | CHANNEL_CHANGED
1 | 100 | NULL | 1 | FALSE
1 | 100 | 100 | 2 | FALSE
1 | 100 | 100 | 3 | FALSE
1 | 200 | 100 | 4 | TRUE
1 | 200 | 200 | 5 | FALSE
1 | 200 | 200 | 6 | FALSE
1 | 300 | 200 | 7 | TRUE
1 | 300 | 300 | 8 | FALSE
1 | 300 | 300 | 9 | FALSE

Output
CUSTOMER_ID | TV_CHANNEL_ID | PREV_CHANNEL_ID | ROW_NUMBER | CHANNEL_CHANGED | CHANGE_SEQUENCE
1 | 100 | NULL | 1 | FALSE | 1
1 | 100 | 100 | 2 | FALSE | 1
1 | 100 | 100 | 3 | FALSE | 1
1 | 200 | 100 | 4 | TRUE | 2
1 | 200 | 200 | 5 | FALSE | 2
1 | 200 | 200 | 6 | FALSE | 2
1 | 300 | 200 | 7 | TRUE | 3
1 | 300 | 300 | 8 | FALSE | 3
1 | 300 | 300 | 9 | FALSE | 3

自前3条记录以来,客户观看了100个频道,其中一个频道对于200频道和300频道非常明智

请告知。

1 个答案:

答案 0 :(得分:1)

在pyspark:

from pyspark.sql import functions as f

cfg = SparkConf().setAppName('s')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()

df = spark.read.csv('...')

df_p1 = df.filter((df['ROW_NUMBER'] == '1') | (df['CHANNEL_CHANGED'] == 'TRUE')) \
    .withColumn('CHANGE_SEQUENCE', f.row_number().over(Window().partitionBy('CUSTOMER_ID').orderBy('ROW_NUMBER')))

df_p2 = df.filter((df['ROW_NUMBER'] != '1') & (df['CHANNEL_CHANGED'] == 'FALSE')).withColumn('CHANGE_SEQUENCE',f.lit(None))

df = df_p1.union(df_p2)\
    .withColumn('CHANGE_SEQUENCE', f.collect_set('CHANGE_SEQUENCE').over(Window().partitionBy('CUSTOMER_ID').orderBy('ROW_NUMBER')))\
    .withColumn('CHANGE_SEQUENCE', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('CHANGE_SEQUENCE'))