在Spark中获得价值与其滞后之间的差异

时间:2017-08-14 15:28:34

标签: apache-spark pyspark spark-dataframe sparkr

我有一个SparkR DataFrame,如下所示。我想创建一个monthdiff列,该列是dates之间的月份,按每个name分组。我怎样才能做到这一点?

#Set up data frame
team <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
  dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08', '2017-06-08','2017-07-24','2017-09-05'))
#Create Spark DataFrame
team <- createDataFrame(team)
#Convert dates to date type
team <- withColumn(team, 'dates', cast(team$dates, 'date'))

这是我迄今为止所尝试的内容,但都导致了错误:

team <- agg(groupBy(team, 'name'), monthdiff=c(NA, months_between(team$dates, lag(team$dates))))
team <- agg(groupBy(team, 'name'), monthdiff=months_between(team$dates, lag(team$dates)))
team <- agg(groupBy(team, 'name'), monthdiff=months_between(select(team, 'dates'), lag(select(team, 'dates'))))

预期产出:

name    | dates     | monthdiff
-------------------------------
Thomas  |2017-01-05 |  NA
Thomas  |2017-02-23 |  1
Thomas  |2017-03-16 |  1
Thomas  |2017-04-08 |  1
Bill    |2017-06-08 |  NA
Bill    |2017-07-24 |  1
Bill    |2017-09-05 |  2

1 个答案:

答案 0 :(得分:0)

基于此post,我调整了SparkR的代码以获得答案。

#Create 'lagdates' variable with lag of dates
window <- orderBy(windowPartitionBy("name"), team$dates)
team <- withColumn(team, 'lagdates', over(lag(team$dates), window))

#Get months_between dates and lagdates
team <- withColumn(team, 'monthdiff', round(months_between(team$dates, team$lagdates)))

name  | dates      | lagdates  | monthdiff
------------------------------------------
Bill  | 2017-06-08 |null       | null
Bill  | 2017-07-24 |2017-06-08 |    2
Bill  | 2017-09-05 |2017-07-24 |    1
Thomas| 2017-01-05 |null       | null
Thomas| 2017-02-23 |2017-01-05 |    2
Thomas| 2017-03-16 |2017-02-23 |    1
Thomas| 2017-04-08 |2017-03-16 |    1