我正在使用一些业务逻辑在spark数据框中添加一列,该逻辑在scala中返回true / false。该实现是使用UDF完成的,并且UDF具有10个以上的参数,因此我们需要在使用UDF之前先注册它。已完成
spark.udf.register("new_col", new_col)
// writing the UDF
val new_col(String, String, ..., Timestamp) => Boolean = (col1: String, col2: String, ..., col12: Timestamp) => {
if ( ... ) true
else false
}
现在,当我尝试编写以下spark / Scala作业时,它不起作用
val result = df.withColumn("new_col", new_col(col1, col2, ..., col12))
我收到以下错误
<console>:56: error: overloaded method value udf with alternatives:
(f: AnyRef,dataType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF10[_, _, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF9[_, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF8[_, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF7[_, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF6[_, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF5[_, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF4[_, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF3[_, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF2[_, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF1[_, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
(f: org.apache.spark.sql.api.java.UDF0[_],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and> ...
另一方面,如果我创建一个临时视图并使用spark.sql,则它如下所示的运行情况非常好
df.createOrReplaceTempView("data")
val result = spark.sql(
s"""
SELECT *, new_col(col1, col2, ..., col12) AS new_col FROM data
"""
)
我想念什么吗?使这种查询在spark / scala中工作的方式是什么?
答案 0 :(得分:1)
注册DataFrames
和SparkSQL
中使用的UDF的方式有多种
要在Spark Sql中使用,udf应该注册为
spark.sqlContext.udf.register("function_name", function)
要在DataFrames
val my_udf = org.apache.spark.sql.functions.udf(function)
在使用spark.sqlContext.udf.register时,它在Spark SQL中可用。
编辑: 下面的代码应该可以工作,我只使用了2个col位,它最多可以工作22个cols
val new_col :(String, String) => Boolean = (col1: String, col2: String) => {
true
}
val new_col_udf = udf(new_col)
spark.sqlContext.udf.register("new_col", new_col)
var df = Seq((1,2,3,4,5,6,7,8,9,10,11)).toDF()
df.createOrReplaceTempView("data")
val result = spark.sql(
s"""SELECT *, new_col(_1, _2) AS new_col FROM data"""
)
result.show()
df = df.withColumn("test", new_col_udf($"_1",$"_2") )
df.show()