pyspark在datetime列中更改日期

时间:2017-03-03 16:03:09

标签: python date apache-spark pyspark pyspark-sql

此代码尝试更改日期时间列的日期有什么问题

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
import datetime

sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)

rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)),
                      ('b',datetime.datetime(2014, 1, 27, 0, 0)),
                      ('c',datetime.datetime(2014, 1, 31, 0, 0))])
testdf = sqlcontext.createDataFrame(rdd, ["id", "date"])

print(testdf.show())
print(testdf.printSchema())

给出一个测试数据框:

+---+--------------------+
| id|                date|
+---+--------------------+
|  a|2014-01-09 00:00:...|
|  b|2014-01-27 00:00:...|
|  c|2014-01-31 00:00:...|
+---+--------------------+


root
 |-- id: string (nullable = true)
 |-- date: timestamp (nullable = true)

然后我定义一个udf来改变日期列:

def change_day_(date, day):
    return date.replace(day=day)

change_day = sf.udf(change_day_, sparktypes.TimestampType())
testdf.withColumn("PaidMonth", change_day(testdf.date, 1)).show(1)

这会引发错误:

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:1)

感谢@ArthurTacca的评论,诀窍是使用这样的pyspark.sql.functions.lit()函数:

testdf.withColumn("PaidMonth", change_day(testdf.date, sf.lit(1))).show()

欢迎替代答案!

答案 1 :(得分:1)

假设收到多个参数的udf接收多个。 “1”不是列。

这意味着您可以执行以下操作之一。根据评论中的建议将其设为列:

testdf.withColumn("PaidMonth", change_day(testdf.date, lit(1))).show(1)

lit(1)是一列

或使原始函数返回更高阶函数:

def change_day_(day):
    return lambda date: date.replace(day=day)

change_day = sf.udf(change_day_(1), sparktypes.TimestampType())
testdf.withColumn("PaidMonth", change_day(testdf.date)).show(1)

这基本上创建了一个替换为1的函数,因此可以接收整数。 udf将适用于单个列。