如何使用Spark KeyValueGroupedDataset的agg方法?

时间:2017-05-17 01:43:19

标签: scala apache-spark

我们有这样的代码:

// think of class A as a table with two columns
case class A(property1: String, property2: Long)

// class B adds a column to class A
case class B(property1: String, property2: Long, property3: String)

df.as[A].map[B](a => {
      val my_udf = // some code here which creates a user defined function
      new B(a.property1, a.property2, my_udf(a))
    })

其中df是DataFrame。接下来我们要创建一个C类型的数据集

// we want to group objects of type B by properties 1 and 3 and compute the average of property2 and also want to store a count
case class C(property1: String, property3: String, average: Long, count: Long)

我们将在sql中创建这样的

select property1, property3, avg(property2), count(*) from B group by property1, property3

我们怎样才能在火花中做到这一点?我们正在尝试使用groupByKey,它将KeyValueGroupedDataSetagg一起提供,但无法让它工作。无法弄清楚如何使用agg

1 个答案:

答案 0 :(得分:2)

如果你有一个名为ds_c的C类数据集,那么你可以这样做(使用groupBy.agg):

ds_c.groupBy("property1", "property3").agg(count($"property2").as("count"), 
                                           avg($"property2").as("mean"))