在PySpark Python

时间:2017-12-07 18:16:19

标签: python-3.x pyspark pyspark-sql

我试图在Python中减去PySpark Dataframe中的两列我遇到了很多问题,我将列类型作为时间戳,列是date1 = 2011-01-03 13:25:59并希望从其他列中减去此列日期列date2 = 2011-01-03 13:27:00所以我想要date2 - date1并从这些数据框列中创建一个单独的timediff列,它显示了这两个列的差异,例如timeDiff = 00:01:01

我怎么能在PySaprk

中做到这一点

我尝试了以下代码:

#timeDiff = df.withColumn(('timeDiff', col(df['date2']) - col(df['date1'])))

这段代码没有用

我尝试过这么简单的事情:

timeDiff = df['date2'] - df['date1']

这实际上有效,但之后我尝试通过以下代码将这个单独的列添加到我的数据框中

df = df.withColumn("Duration", timeDiff)

它有以下错误:

Py4JJavaError: An error occurred while calling o107.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '(`date2` - `date1`)' due to data type mismatch: '(`date2` - `date1`)' requires (numeric or calendarinterval) type, not timestamp;;

任何人都可以帮助我解决任何其他方法或如何解决此错误?

2 个答案:

答案 0 :(得分:3)

希望这有帮助!

from pyspark.sql.functions import unix_timestamp

#sample data
df = sc.parallelize([
    ['2011-01-03 13:25:59', '2011-01-03 13:27:00'],
    ['2011-01-03 3:25:59',  '2011-01-03 3:30:00']
]).toDF(('date1', 'date2'))

timeDiff = (unix_timestamp('date2', "yyyy-MM-dd HH:mm:ss") - unix_timestamp('date1', "yyyy-MM-dd HH:mm:ss"))
df = df.withColumn("Duration", timeDiff)
df.show()

输出是:

+-------------------+-------------------+--------+
|              date1|              date2|Duration|
+-------------------+-------------------+--------+
|2011-01-03 13:25:59|2011-01-03 13:27:00|      61|
| 2011-01-03 3:25:59| 2011-01-03 3:30:00|     241|
+-------------------+-------------------+--------+

答案 1 :(得分:0)

同意以上答案,谢谢!

但我认为可能需要更改为:

addSbtPlugin("com.eed3si9n" % "sbt-buildinfo" % "0.10.0")

addSbtPlugin("com.codecommit" % "sbt-github-packages" % "0.5.2")

给定

timeDiff = (unix_timestamp(F.col('date2'), "yyyy-MM-dd HH:mm:ss") - unix_timestamp(F.col('date1'), "yyyy-MM-dd HH:mm:ss"))