我可以在窗口函数中使用UDAF吗?

时间:2016-05-14 18:33:26

标签: apache-spark apache-spark-sql

我创建了一个用户定义的聚合函数。它将所有累积值连接到列表(ArrayType)。它被称为EdgeHistory

如果我没有指定窗口,它可以正常工作。它返回所有list的数组。但是通过以下示例,它失败了:

case class ExampleRow(n: Int, list: List[(String, String, Float, Float)])

val x = Seq(
  ExampleRow(1, List(("a", "b", 1f, 2f), ("c", "d", 2f, 3f))),
  ExampleRow(2, List(("a", "b", 2f, 4f), ("c", "d", 4f, 6f))),
  ExampleRow(3, List(("a", "b", 4f, 8f), ("c", "d", 8f, 12f)))
)

val df = sc.parallelize(x).toDF()

val edgeHistory = new EdgeHistory()

val y = df.agg(edgeHistory('list).over(Window.orderBy("n").rangeBetween(1, 0)))

它会抛出错误:

STDERR: Exception in thread "main" java.lang.UnsupportedOperationException: EdgeHistory('list) is not supported in a window operation.
    at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:177)
    at org.apache.spark.sql.Column.over(Column.scala:1052)
    at szdavid92.AnalyzeGraphStream$.main(AnalyzeGraphStream.scala:75)

错误消息似乎非常简单。看来你无法在windows中定义UDAF。 我理解正确吗? 为什么会出现这种限制?

更新

我尝试使用SQL语法,然后收到相关错误

df.registerTempTable("data")
sqlContext.udf.register("edge_history", edgeHistory)

val y = sqlContext.sql(
  """
    |SELECT n, list, edge_history(list) OVER (ORDER BY n ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
    |FROM data
  """.stripMargin)

Exception in thread "main" org.apache.spark.sql.AnalysisException: Couldn't find window function edge_history;
    at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)
    at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)

0 个答案:

没有答案