如何声明一个udf
- 即base class
或trait
中需要覆盖的内容:
这是similarity udf
声明:
val simUdf: udf( (entityA: Seq[String], entityB: Seq[String]) => Double)
但它没有编译:
Error:(29, 18) ';' expected but '(' found.
val simUdf: udf( (entityA: Seq[String], entityB: Seq[String]) => Double)
请注意,def
代替val
会导致相同的错误
答案 0 :(得分:1)
它只是UserDefinedFunction
,因此您可以通过以下方式声明:
val simUdf: /*org.apache.spark.sql.expressions.*/UserDefinedFunction
答案 1 :(得分:0)
udf function会返回UserDefinedFunction类型。您可以将不同的实现传递给udf函数。这就是我的意思:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._ // for udf and lit
trait Foo {
def name: String
def calc(n: Int): Int
}
class Bar extends Foo {
def name: String = "Bar"
def calc(n: Int): Int = n + 1
}
class Baz extends Foo{
def name: String = "Baz"
def calc(n: Int): Int = n * 2
}
val foo1: Foo = new Bar()
val foo2: Foo = new Baz()
def process(df: DataFrame, foo: Foo): DataFrame = {
// Pass exact implementation to udf function
val udf1: UserDefinedFunction = udf(foo.calc _) // _ to make it partially applied
df.withColumn("calc", udf1(col("n")))
.withColumn("name", lit(foo.name))
}
val data: Seq[Int] = Seq(1, 2, 3)
val df: DataFrame = data.toDF("n")
val r1 = process(df, foo1)
println("foo1: ")
r1.show()
val r2 = process(df, foo2)
println("foo2: ")
r2.show()
结果:
foo1:
+---+----+----+
| n|calc|name|
+---+----+----+
| 1| 2| Bar|
| 2| 3| Bar|
| 3| 4| Bar|
+---+----+----+
foo2:
+---+----+----+
| n|calc|name|
+---+----+----+
| 1| 2| Baz|
| 2| 4| Baz|
| 3| 6| Baz|
+---+----+----+