How can I call a UDF in a UDF?

时间:2016-04-15 14:58:16

标签: scala apache-spark hive apache-spark-sql

Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.

Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.

So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.

So I have this code

import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}

object Functions {
  val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
  val ctx: SparkContext = new SparkContext(conf)
  val hctx: HiveContext = new HiveContext(ctx)

  import hctx.implicits._

  def lag(x: Double, window: Int): Double = {
    x
  }

  def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
    slag(c, window).over(windowSpec).as(c.toString() + "_lag")
  }

  def main(args: Array[String]): Unit = {
    val funcUdf = udf((f: Column) => lag(f))
    val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
    implicit val spec: WindowSpec = Window.orderBy($"value")
    data.select(funcUdf($"value")).show()
  }
}

Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?

0 个答案:

没有答案