Pyspark-创建列后立即提供UDF功能

时间:2019-10-15 11:40:04

标签: python apache-spark pyspark

我正在尝试在创建列后立即应用UDF函数。

但是我遇到了问题:

Cannot resolve column name "previous_status" among

这意味着该列不存在。

我可能会修改UDF函数,以使其不再是UDF,而只是具有F.whenotherwise的普通函数。关键是,我需要一个全局字典,如您所见,以确定我是否已经看到该ID。

alreadyAuthorized = {}

def previously_authorized_spark(id, failed, alreadyAuthorized = alreadyAuthorized):
    if id in alreadyAuthorized:
        previously_authorized = 1
    else:
        previously_authorized = 0

    if not failed:
        alreadyAuthorized[id] = True

    return previously_authorized

previously_authorized_udf = udf(lambda x, y : previously_authorized_spark(x, y), IntegerType())

def get_previous_status(data):
    partition = Window.partitionBy("id").orderBy("date")

    data = data.withColumn("previous_status", F.lag(F.col("failed")).over(partition))\
                .withColumn("previously_authorized", previously_authorized_udf(data["id"], data["previous_status"]))

data = get_previous_status(data)

1 个答案:

答案 0 :(得分:1)

请尝试使用col函数来获取列,因为正如@LaSul指出的那样,您在分配data之前就使用了它:

from pyspark.sql.function import col

...
    data = data.withColumn("previous_status", F.lag(F.col("failed")).over(partition))\
                .withColumn("previously_authorized", previously_authorized_udf(col("id"), col("previous_status")))

...