如何声明(不定义)spark udf

时间:2018-03-31 07:13:03

标签: scala apache-spark

如何声明一个udf - 即base classtrait中需要覆盖的内容:

这是similarity udf声明:

val simUdf: udf( (entityA: Seq[String], entityB: Seq[String]) => Double)

但它没有编译:

Error:(29, 18) ';' expected but '(' found.
  val simUdf: udf( (entityA: Seq[String], entityB: Seq[String]) => Double)

请注意,def代替val会导致相同的错误

enter image description here

2 个答案:

答案 0 :(得分:1)

它只是UserDefinedFunction,因此您可以通过以下方式声明:

val simUdf: /*org.apache.spark.sql.expressions.*/UserDefinedFunction

答案 1 :(得分:0)

udf function会返回UserDefinedFunction类型。您可以将不同的实现传递给udf函数。这就是我的意思:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._ // for udf and lit

trait Foo {
  def name: String
  def calc(n: Int): Int
}
class Bar extends Foo {
  def name: String = "Bar"
  def calc(n: Int): Int = n + 1
}
class Baz extends Foo{
  def name: String = "Baz"
  def calc(n: Int): Int = n * 2
}

val foo1: Foo = new Bar()
val foo2: Foo = new Baz()

def process(df: DataFrame, foo: Foo): DataFrame = {
  // Pass exact implementation to udf function
  val udf1: UserDefinedFunction = udf(foo.calc _) // _ to make it partially applied
  df.withColumn("calc", udf1(col("n")))
    .withColumn("name", lit(foo.name))
}

val data: Seq[Int] = Seq(1, 2, 3)
val df: DataFrame = data.toDF("n")

val r1 = process(df, foo1)
println("foo1: ")
r1.show()

val r2 = process(df, foo2)
println("foo2: ")
r2.show()

结果:

foo1: 
+---+----+----+
|  n|calc|name|
+---+----+----+
|  1|   2| Bar|
|  2|   3| Bar|
|  3|   4| Bar|
+---+----+----+

foo2: 
+---+----+----+
|  n|calc|name|
+---+----+----+
|  1|   2| Baz|
|  2|   4| Baz|
|  3|   6| Baz|
+---+----+----+