我创建了一个用户定义的聚合函数。它将所有累积值连接到列表(ArrayType
)。它被称为EdgeHistory
。
如果我没有指定窗口,它可以正常工作。它返回所有list
的数组。但是通过以下示例,它失败了:
case class ExampleRow(n: Int, list: List[(String, String, Float, Float)])
val x = Seq(
ExampleRow(1, List(("a", "b", 1f, 2f), ("c", "d", 2f, 3f))),
ExampleRow(2, List(("a", "b", 2f, 4f), ("c", "d", 4f, 6f))),
ExampleRow(3, List(("a", "b", 4f, 8f), ("c", "d", 8f, 12f)))
)
val df = sc.parallelize(x).toDF()
val edgeHistory = new EdgeHistory()
val y = df.agg(edgeHistory('list).over(Window.orderBy("n").rangeBetween(1, 0)))
它会抛出错误:
STDERR: Exception in thread "main" java.lang.UnsupportedOperationException: EdgeHistory('list) is not supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:177)
at org.apache.spark.sql.Column.over(Column.scala:1052)
at szdavid92.AnalyzeGraphStream$.main(AnalyzeGraphStream.scala:75)
错误消息似乎非常简单。看来你无法在windows中定义UDAF。 我理解正确吗? 为什么会出现这种限制?
更新
我尝试使用SQL语法,然后收到相关错误
df.registerTempTable("data")
sqlContext.udf.register("edge_history", edgeHistory)
val y = sqlContext.sql(
"""
|SELECT n, list, edge_history(list) OVER (ORDER BY n ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
|FROM data
""".stripMargin)
即
Exception in thread "main" org.apache.spark.sql.AnalysisException: Couldn't find window function edge_history;
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)