我是PySpark的新手,我试图转换一些衍生出新变量' COUNT_IDX'的python代码。新变量的初始值为1,但在满足条件时会增加1。否则,新变量值将与上一条记录中的值相同。
增加的条件是: TRIP_CD不等于先前的记录TRIP_CD 或 SIGN不等于之前的记录SIGN 或 time_diff不等于1.
Python代码(pandas dataframe):
df['COUNT_IDX'] = 1
for i in range(1, len(df)):
if ((df['TRIP_CD'].iloc[i] != df['TRIP_CD'].iloc[i - 1])
or (df['SIGN'].iloc[i] != df['SIGN'].iloc[i-1])
or df['time_diff'].iloc[i] != 1):
df['COUNT_IDX'].iloc[i] = df['COUNT_IDX'].iloc[i-1] + 1
else:
df['COUNT_IDX'].iloc[i] = df['COUNT_IDX'].iloc[i-1]
以下是预期结果:
TRIP_CD SIGN time_diff COUNT_IDX
2711 - 1 1
2711 - 1 1
2711 + 2 2
2711 - 1 3
2711 - 1 3
2854 - 1 4
2854 + 1 5
在PySpark中,我将COUNT_IDX初始化为1.然后使用Window函数,我采用了TRIP_CD和SIGN的滞后并计算了time_diff,然后尝试了:
df = sqlContext.sql('''
select TRIP, TRIP_CD, SIGN, TIME_STAMP, seconds_diff,
case when TRIP_CD != TRIP_lag or SIGN != SIGN_lag or seconds_diff != 1
then (lag(COUNT_INDEX) over(partition by TRIP order by TRIP, TIME_STAMP))+1
else (lag(COUNT_INDEX) over(partition by TRIP order by TRIP, TIME_STAMP))
end as COUNT_INDEX from df''')
这给了我类似的东西:
TRIP_CD SIGN time_diff COUNT_IDX
2711 - 1 1
2711 - 1 1
2711 + 2 2
2711 - 1 2
2711 - 1 1
2854 - 1 2
2854 + 1 2
如果COUNT_IDX在之前的记录上更新,则当前记录中的COUNT_IDX不会识别要更改以进行计算。就像COUNTI_IDX没有被覆盖或者没有被逐行评估一样。关于如何解决这个问题的任何想法?
答案 0 :(得分:1)
你需要累积金额:
-- cumulative sum
SUM(CAST(
-- if at least one condition has been satisfied
-- we take 1 otherwise 0
TRIP_CD != TRIP_lag OR SIGN != SIGN_lag OR seconds_diff != 1 AS LONG
)) OVER W
...
WINDOW W AS (PARTITION BY trip ORDER BY times_stamp)