PySpark条件增量

时间:2016-07-31 21:17:10

标签: sql apache-spark apache-spark-sql window-functions

我是PySpark的新手,我试图转换一些衍生出新变量' COUNT_IDX'的python代码。新变量的初始值为1,但在满足条件时会增加1。否则,新变量值将与上一条记录中的值相同。

增加的条件是: TRIP_CD不等于先前的记录TRIP_CD SIGN不等于之前的记录SIGN time_diff不等于1.

Python代码(pandas dataframe):

df['COUNT_IDX'] = 1

for i in range(1, len(df)):
    if ((df['TRIP_CD'].iloc[i] != df['TRIP_CD'].iloc[i - 1])
          or (df['SIGN'].iloc[i] != df['SIGN'].iloc[i-1])
          or df['time_diff'].iloc[i] != 1):
        df['COUNT_IDX'].iloc[i] = df['COUNT_IDX'].iloc[i-1] + 1
    else:
        df['COUNT_IDX'].iloc[i] = df['COUNT_IDX'].iloc[i-1]

以下是预期结果:

TRIP_CD   SIGN   time_diff  COUNT_IDX
2711      -      1          1
2711      -      1          1
2711      +      2          2
2711      -      1          3
2711      -      1          3
2854      -      1          4
2854      +      1          5

在PySpark中,我将COUNT_IDX初始化为1.然后使用Window函数,我采用了TRIP_CD和SIGN的滞后并计算了time_diff,然后尝试了:

df = sqlContext.sql('''
   select TRIP, TRIP_CD, SIGN, TIME_STAMP, seconds_diff,
   case when TRIP_CD != TRIP_lag or SIGN != SIGN_lag  or  seconds_diff != 1 
        then (lag(COUNT_INDEX) over(partition by TRIP order by TRIP, TIME_STAMP))+1
        else (lag(COUNT_INDEX) over(partition by TRIP order by TRIP, TIME_STAMP)) 
        end as COUNT_INDEX from df''')

这给了我类似的东西:

TRIP_CD   SIGN   time_diff  COUNT_IDX
2711      -      1          1
2711      -      1          1
2711      +      2          2
2711      -      1          2
2711      -      1          1
2854      -      1          2
2854      +      1          2

如果COUNT_IDX在之前的记录上更新,则当前记录中的COUNT_IDX不会识别要更改以进行计算。就像COUNTI_IDX没有被覆盖或者没有被逐行评估一样。关于如何解决这个问题的任何想法?

1 个答案:

答案 0 :(得分:1)

你需要累积金额:

-- cumulative sum
SUM(CAST(  
  -- if at least one condition has been satisfied
  -- we take 1 otherwise 0
  TRIP_CD != TRIP_lag OR SIGN != SIGN_lag OR seconds_diff != 1 AS LONG
)) OVER W
...
WINDOW W AS (PARTITION BY trip ORDER BY times_stamp)