Question

使用Scala Spark，如何使用类型化的数据集API来舍入聚合列？

另外，如何通过groupby操作保留数据集的类型？

这就是我目前所拥有的：

allModels

如果我用case class MyRow( k1: String, k2: String, c1: Double, c2: Double ) def groupTyped(ds: Dataset[MyRow]): Dataset[MyRow] = { import org.apache.spark.sql.expressions.scalalang.typed._ ds.groupByKey(row => (row.k1, row.k2)) .agg( avg(_.c1), avg(_.c2) ) .map(r => MyRow(r._1._1, r._1._2, r._2, r._3)) }替换avg(_.c1)，我会收到类型错误。围绕我的价值观的正确方法是什么？
round(avg(_.c1))行感觉不对 - 是否有更优雅的方式来保留我的数据集类型？

谢谢！

Answer 1

虽然接受的答案有效，但更为一般，在这种情况下，您也可以使用圆形。你只需要在使用.as[T]进行舍入后输入列（也需要定义avg类型）。

.agg(
  // Alternative ways to define a type to avg
  round(avg((r: MyRow) => r.c1)).as[Double],
  round(avg[MyRow](_.c2)).as[Double]
)

Answer 2

使用round确实在类型错误上失败，因为agg期望类型为TypedColumn[IN, OUT]的聚合函数，而round提供Column（适用于DataFrame）。

这里需要的是舍入平均聚合函数，它不在org.apache.spark.sql.expressions.scalalang.typed._中提供 - 但您可以通过扩展执行平均聚合的类来轻松地创建一个：

// Extend TypedAverage - round the result before returning it
class TypedRoundAverage[IN](f: IN => Double) extends TypedAverage[IN](f) {
  override def finish(reduction: (Double, Long)): Double = math.round(super.finish(reduction))
}

// A nice wrapper to create the TypedRoundAverage for a given function  
def roundAvg[IN](f: IN => Double): TypedColumn[IN, Double] = new TypedRoundAverage(f).toColumn

// Now you can use "roundAvg" instead of "round"  
def groupTyped(ds: Dataset[MyRow]): Dataset[MyRow] = {
  ds.groupByKey(row => (row.k1, row.k2))
    .agg(
      roundAvg(_.c1),
      roundAvg(_.c2)
    )
    .map { case ((k1, k2), c1, c2) => MyRow(k1, k2, c1, c2) } // just a nicer way to put it
}

我无法找到摆脱map操作的方法，因为group-by必然返回一个元组，但使用模式匹配可以使它更好一些

如何在Spark数据集中舍入一列？

2 个答案: