我正在尝试在事件数据中计算pyspark中的Datediff和count_diff。
数据看起来像这样
deviceid techid name count load_date
m1 1 a 30 23-01-2016
m2 1 b 40 23-01-2016
m1 1 a 45 29-01-2016
m1 2 a 50 30-01-2016
我希望它看起来像这样
deviceid techid name count load_date datediff countdiff
m1 1 a 30 23-01-2016 NA NA
m2 1 b 40 23-01-2016 NA NA
m1 1 a 45 29-01-2016 6 15
m1 2 a 50 30-01-2016 NA NA
如何在pyspark中创建一个包含这些值的列,并根据事件条件的变化采用约会。
答案 0 :(得分:0)
这可以使用开窗功能解决。
(1)使用示例测试数据创建数据框
df = spark.createDataFrame([('m1',1,'a',30,'23-01-2016'),('m2',1,'b',40,'23-01-2016'),('m1',1,'a',45,'29-01-2016'),('m1',2,'a',50,'30-01-2016')], ['deviceid','techid','name','count','load_date'])
df1 = df.selectExpr("deviceid","techid","name","count","to_timestamp(load_date, 'dd-MM-yyyy') AS load_date")
(2)定义窗口,并使用lag
函数构建先前的计数和先前的加载日期列逻辑
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,lag
windowSpec = Window.partitionBy('deviceid','techid').orderBy('load_date')
prev_count = lag('count').over(windowSpec).alias('rank')
prev_load_date = lag('load_date').over(windowSpec).alias('rank')
df2 = df1.withColumn("prev_count", prev_count) \
.withColumn("prev_load_date", prev_load_date)
(3)用原始列减去原始列以计算差异。
df2.selectExpr("deviceid",
"techid",
"name",
"count",
"load_date",
"datediff(load_date,prev_load_date) AS datediff",
"(count - prev_count) AS countdiff")\
.show()
#+--------+------+----+-----+-------------------+--------+---------+
#|deviceid|techid|name|count| load_date|datediff|countdiff|
#+--------+------+----+-----+-------------------+--------+---------+
#| m1| 1| a| 30|2016-01-23 00:00:00| null| null|
#| m1| 1| a| 45|2016-01-29 00:00:00| 6| 15|
#| m1| 2| a| 50|2016-01-30 00:00:00| null| null|
#| m2| 1| b| 40|2016-01-23 00:00:00| null| null|
#+--------+------+----+-----+-------------------+--------+---------+