将UDF应用于spark 2.0

时间:2016-10-10 20:10:05

标签: apache-spark pyspark spark-dataframe apache-spark-mllib

我正在尝试将UDF应用于包含SparseVectors的PySpark df中的列(使用pyspark.ml.feature.IDF创建)。最初,我试图应用更复杂的函数,但是对于函数的任何应用程序都会遇到相同的错误。所以为了一个例子:

udfSum = udf(lambda x: np.sum(x.values), FloatType()) 
df = df.withColumn("vec_sum", udfSum(df.idf)) 
df.take(10) 

我收到此错误:

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 55.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 55.0 (TID 111, 10.0.11.102): net.razorvine.pickle.PickleException:
expected zero arguments for construction of ClassDict (for numpy.dtype)

如果我将df转换为Pandas并应用该函数,我可以确认FloatType()是正确的响应类型。这可能是相关的,但错误是不同的:Issue with UDF on a column of Vectors in PySpark DataFrame

谢谢!

1 个答案:

答案 0 :(得分:2)

将输出转换为float

udf(lambda x: float(np.sum(x.values)), FloatType())