连续行之间的日期差异 - Pyspark Dataframe

时间:2016-07-02 04:04:51

标签: python apache-spark pyspark pyspark-sql

我有一个包含以下结构的表

USER_ID     Tweet_ID                 Date
  1           1001       Thu Aug 05 19:11:39 +0000 2010
  1           6022       Mon Aug 09 17:51:19 +0000 2010
  1           1041       Sun Aug 19 11:10:09 +0000 2010
  2           9483       Mon Jan 11 10:51:23 +0000 2012
  2           4532       Fri May 21 11:11:11 +0000 2012
  3           4374       Sat Jul 10 03:21:23 +0000 2013
  3           4334       Sun Jul 11 04:53:13 +0000 2013

基本上我想要做的是使用 PysparkSQL 查询来计算具有相同user_id编号的连续记录的日期差异(以秒为单位)。预期结果将是:

1      Sun Aug 19 11:10:09 +0000 2010 - Mon Aug 09 17:51:19 +0000 2010     839930
1      Mon Aug 09 17:51:19 +0000 2010 - Thu Aug 05 19:11:39 +0000 2010     340780
2      Fri May 21 11:11:11 +0000 2012 - Mon Jan 11 10:51:23 +0000 2012     1813212
3      Sun Jul 11 04:53:13 +0000 2013 - Sat Jul 10 03:21:23 +0000 2013     5510

3 个答案:

答案 0 :(得分:7)

另一种方式可能是:

df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1)
.over(Window.partitionBy("user_‌​id")
.orderBy("date")‌​))
.cast("bigint"))

答案 1 :(得分:2)

像这样:

df.registerTempTable("df")

sqlContext.sql("""
     SELECT *, CAST(date AS bigint) - CAST(lag(date, 1) OVER (
              PARTITION BY user_id ORDER BY date) AS bigint) 
     FROM df""")

答案 2 :(得分:1)

编辑,感谢@cool_kid

@Joesemy的回答确实很好,但由于cast(“ bigint”)抛出错误,因此对我不起作用。因此,我以这种方式使用了 pyspark.sql.functions模块中的datediff函数,它起作用了:

from pyspark.sql.functions import *
from pyspark.sql.window import Window

df.withColumn("time_intertweet", datediff(df.date, lag(df.date, 1)
    .over(Window.partitionBy("user_‌​id")
    .orderBy("date")‌​)))