函数：

Question

我认为这应该相对容易实现，但是在我注册UDF并执行它时遇到错误。

我以模块化方式编写了一些python函数，因此它调用了其他python函数（例如，我有一个python函数带有一个列，并且是一个相对简单的UDF，但是在其中必须使用另一个UDF，我称之为check_existence（col ），该方法首先对表运行sql查询，以在运行其余逻辑之前检查该col的当前值是否存在）。

我试图将两个函数都注册为UDF，但这很有意义

def check_exists(error):
  query_string = """select count(*) as count from my_data_frame where error_code = \"{0}\"""".format(error)
  result = spark.sql(query_string).toPandas()["count"][0]
  if result:
    return True 
  else: 
    return False 

#register as a udf 
from pyspark.sql.functions import udf 
from pyspark.sql.types import ArrayType, FloatType, BooleanType
check_exists_udf = udf("check_exists",BooleanType())

我实际上想针对调用check_exists的表使用的

函数：

def detect_col(error_code):
  if check_exists_udf(error_code):
    return 1 
  return 0

使用spark udf注册为udf

spark.udf.register("detect_col_udf", lambda error_code: detect_col(error_code), StringType())

%sql

select detect_col_udf(error_code, count), error_code, count, time
from time_series_view

实际结果：

SQL语句中的错误：SparkException：作业由于阶段失败而中止：218.0阶段中的任务0失败1次，最近一次失败：阶段218.0中的任务0.0丢失（TID 4597，本地主机，执行程序驱动程序）：org.apache。 spark.api.python.PythonException：追溯（最近一次呼叫过去）：主目录中的文件“ /databricks/spark/python/pyspark/worker.py”，第403行 process（）

“ strong> init 中的文件“ /databricks/spark/python/pyspark/sql/udf.py”，第84行 “ {0}”。format（type（func））） TypeError：无效的函数：不是函数或可调用的（未定义调用）：

如何在另一个UDF中正确使用UDF？（PySpark SQL）

函数：

使用spark udf注册为udf

0 个答案:

如何在另一个UDF中正确使用UDF？ （PySpark SQL）

函数：

使用spark udf注册为udf

0 个答案:

如何在另一个UDF中正确使用UDF？（PySpark SQL）