Pyspark UDF-TypeError:“模块”对象不可调用

时间:2019-03-01 08:37:17

标签: python pyspark user-defined-functions

我正在尝试根据我在网上找到的一些教程来运行以下代码:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql import udf
df_pd = pd.DataFrame(
data={'integers': [1, 2, 3],
 'floats': [-1.0, 0.5, 2.7],
 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]}
)

df = spark.createDataFrame(df_pd)
df.show()

def square(x):
    return x**2
from pyspark.sql.types import IntegerType
square_udf_int = udf(lambda z: square(z), IntegerType())

但是当我运行最后一行时,出现以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'module' object is not callable

我正在Hadoop 2.7上使用spark 2.3.3。

谢谢

2 个答案:

答案 0 :(得分:0)

似乎您是从pyspark.sql导入的,而它应该是pyspark.sql.functions 喜欢...

import pyspark.sql.functions as F

     udf_fun = F.udf (lambda..., Type())

答案 1 :(得分:-2)

似乎您是在以非Python方式调用UDF。在python中,规范至关重要。我做了以下更改,效果很好

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql import udf
df_pd = pd.DataFrame(
data={'integers': [1, 2, 3],
 'floats': [-1.0, 0.5, 2.7],
 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]}
)

df = spark.createDataFrame(df_pd)
df.show()

def square(x):
    return x**2

def call_udf():
  from pyspark.sql.types import IntegerType
  square_udf_int = udf(lambda z: square(z), IntegerType())