Pyspark-将时间戳传递给udf

时间:2018-07-19 14:01:28

标签: python pyspark

我正在尝试根据以下时间戳检查条件,这使我出错。谁能指出我在这里做错了吗?

timestamp1 = pd.to_datetime('2018-02-14 12:09:36.0')
timestamp2 = pd.to_datetime('2018-02-14 12:10:00.0')
def check_formula(timestamp2, timestamp1, interval):
        if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
            return True
        else:
            return False

chck_formula = udf(check_formula, BooleanType())
ts= chck_formula(timestamp2, timestamp1, 5)
print(ts)

以下是我得到的错误-

An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.sql.Timestamp]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:0)

无论执行什么操作,我们都需要使用rdddataframe。因此,您只能将udf应用于其中任何一个。因此,您需要更改应用udf的方式。

有2种方法:-

from pyspark.sql import functions as F 
import datetime

df = sqlContext.createDataFrame([
    ['2018-02-14 12:09:36.0', '2018-02-14 12:10:00.0'],
], ["t1", "t2"])

interval = 5

df.withColumn("check", F.datediff(F.col("t2"),F.col("t1")) <= datetime.timedelta(minutes=(interval/2)).total_seconds()).show(truncate=False)


+---------------------+---------------------+-----+
|t1                   |t2                   |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+


from pyspark.sql.functions import udf, lit
from pyspark.sql.types import BooleanType

def check_formula(timestamp2, timestamp1, interval):
        if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
            return True
        else:
            return False


chck_formula = udf(check_formula, BooleanType())

df.withColumn("check", chck_formula(F.from_utc_timestamp(F.col("t2"), "PST"), F.from_utc_timestamp(F.col("t1"), "PST"), F.lit(5))).show(truncate=False)

+---------------------+---------------------+-----+
|t1                   |t2                   |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+