鉴于我有这个代码,它产生一个如图所示的df:
l = [('CM1','aa1', 3.0, None, datetime.datetime(2017, 5, 30, 20,0,1)),\
('CM1','aa1', None, .1, datetime.datetime(2017, 5, 30, 20,0,4)),\
('CM1','aa1', None, .2, datetime.datetime(2017, 5, 30, 20,0,8)),\
('CM1','aa1', None, .3, datetime.datetime(2017, 5, 30, 20,0,12)),\
('CM1','aa1', None, .4, datetime.datetime(2017, 5, 30, 20,0,30)),\
('CM1','aa1', None, .0, datetime.datetime(2017, 5, 30, 20,0,33)),\
('CM1','aa1', 2.0, None, datetime.datetime(2017, 5, 30, 20,0,37)),\
('CM1','aa1', None, .1, datetime.datetime(2017, 5, 30, 20,0,39)),\
('CM1','aa1', None, .0, datetime.datetime(2017, 5, 30, 20,0,39)),\
('CM1','aa1', None, .2, datetime.datetime(2017, 5, 30, 20,0,49)),\
('CM1','aa1', None, .8, datetime.datetime(2017, 5, 30, 20,0,55)),\
('CM1','aa1', 4.0, None, datetime.datetime(2017, 5, 30, 20,0,59))
]
schema = StructType([StructField('customid', StringType(), True),
StructField('procid', StringType(), True),
StructField('speed', DoubleType(), True),
StructField('wait', DoubleType(), True),
StructField('timestamp', TimestampType(), True)]
)
rdd = sc.parallelize(l)
df = sqlContext.createDataFrame(rdd,schema)
df = df.withColumn('u_ts', unix_timestamp(df.timestamp))
w = \
Window.partitionBy(df['procid']).orderBy(df['timestamp'].asc())#.rangeBetween(-1, 0)
df = df.withColumn('delay', (psf.lag(df.u_ts, 0).over(w))-(psf.lag(df.u_ts, 1).over(w)))
df.show()
-
+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed|wait| timestamp| u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
| CM1| aa1| 3.0|null|2017-05-30 20:00:01|1496167201| null|
| CM1| aa1| null| 0.1|2017-05-30 20:00:04|1496167204| 3|
| CM1| aa1| null| 0.2|2017-05-30 20:00:08|1496167208| 4|
| CM1| aa1| null| 0.3|2017-05-30 20:00:12|1496167212| 4|
| CM1| aa1| null| 0.4|2017-05-30 20:00:30|1496167230| 18|
| CM1| aa1| null| 0.0|2017-05-30 20:00:33|1496167233| 3|
| CM1| aa1| 2.0|null|2017-05-30 20:00:37|1496167237| 4|
| CM1| aa1| null| 0.1|2017-05-30 20:00:39|1496167239| 2|
| CM1| aa1| null| 0.0|2017-05-30 20:00:39|1496167239| 0|
| CM1| aa1| null| 0.2|2017-05-30 20:00:49|1496167249| 10|
| CM1| aa1| null| 0.8|2017-05-30 20:00:55|1496167255| 6|
| CM1| aa1| 4.0|null|2017-05-30 20:00:59|1496167259| 4|
+--------+------+-----+----+-------------------+----------+-----+
目标是根据以下内容计算并填充每个速度条目,该条目为空: (s,w,d,指速度,等待和延迟列)
+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed |wait| timestamp| u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
| CM1| aa1| 3.0 |null|2017-05-30 20:00:01|1496167201| null|
| CM1| aa1| s[0]+w[1]*d[1]| 0.1|2017-05-30 20:00:04|1496167204| 3|
| CM1| aa1| s[1]+w[2]*d[2]| 0.2|2017-05-30 20:00:08|1496167208| 4|
| CM1| aa1| s[2]+w[3]*d[3]| 0.3|2017-05-30 20:00:12|1496167212| 4|
| CM1| aa1| s[3]+w[4]*d[4]| 0.4|2017-05-30 20:00:30|1496167230| 18|
| CM1| aa1| s[4]+w[5]*d[5]| 0.0|2017-05-30 20:00:33|1496167233| 3|
| CM1| aa1| 2.0 |null|2017-05-30 20:00:37|1496167237| 4|
| CM1| aa1| s[6]+w[7]*d[7]| 0.1|2017-05-30 20:00:39|1496167239| 2|
| CM1| aa1| s[7]+w[8]*d[8]| 0.0|2017-05-30 20:00:39|1496167239| 0|
| CM1| aa1| s[9]+w[10]*d[10]| 0.2|2017-05-30 20:00:49|1496167249| 10|
| CM1| aa1| s[10]+w[11]*d[11]| 0.8|2017-05-30 20:00:55|1496167255| 6|
| CM1| aa1| 4.0 |null|2017-05-30 20:00:59|1496167259| 4|
+--------+------+-----+----+-------------------+----------+-----+
我通过以下方式实施了解决方案:
for i in range(5):
df = df.withColumn('speed',
psf.when(df.speed.isNull() == True,\
(psf.lag(df.wait, 0).over(w))*df.delay+psf.lag(df.speed, 1).over(w))\
.otherwise(df.speed))
#df = df.withColumn('speed',psf.coalesce(df.speed, df.result))
df.show()
结果还可以:
+--------+------+-----+----+-------------------+----------+-----+
|customid|procid|speed|wait| timestamp| u_ts|delay|
+--------+------+-----+----+-------------------+----------+-----+
| CM1| aa1| 3.0|null|2017-05-30 20:00:01|1496167201| null|
| CM1| aa1| 3.3| 0.1|2017-05-30 20:00:04|1496167204| 3|
| CM1| aa1| 4.1| 0.2|2017-05-30 20:00:08|1496167208| 4|
| CM1| aa1| 5.3| 0.3|2017-05-30 20:00:12|1496167212| 4|
| CM1| aa1| 12.5| 0.4|2017-05-30 20:00:30|1496167230| 18|
| CM1| aa1| 12.5| 0.0|2017-05-30 20:00:33|1496167233| 3|
| CM1| aa1| 2.0|null|2017-05-30 20:00:37|1496167237| 4|
| CM1| aa1| 2.2| 0.1|2017-05-30 20:00:39|1496167239| 2|
| CM1| aa1| 2.2| 0.0|2017-05-30 20:00:39|1496167239| 0|
| CM1| aa1| 4.2| 0.2|2017-05-30 20:00:49|1496167249| 10|
| CM1| aa1| 9.0| 0.8|2017-05-30 20:00:55|1496167255| 6|
| CM1| aa1| 4.0|null|2017-05-30 20:00:59|1496167259| 4|
+--------+------+-----+----+-------------------+----------+-----+
它确实在几百个procid组上运行,但处理速度非常慢。这是实施解决方案的正确方法,它不会浪费计算能力吗?
我不确定循环和if语句的情况:它是否也只在窗口上工作,或者df的每一列都是由withColumn / case表达式整体影响?