如何使用数据集进行分组

时间:2017-06-07 06:12:38

标签: apache-spark dataset apache-spark-2.0

我有一个使用rdd的请求:

val test = Seq(("New York", "Jack"),
    ("Los Angeles", "Tom"),
    ("Chicago", "David"),
    ("Houston", "John"),
    ("Detroit", "Michael"),
    ("Chicago", "Andrew"),
    ("Detroit", "Peter"),
    ("Detroit", "George")
  )
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)

结果是:

  

(纽约,名单(杰克))

     

(底特律,名单(迈克尔,彼得,乔治))

     

(洛杉矶,名单(汤姆))

     

(休斯敦,列表(约翰))

     

(芝加哥,名单(大卫,安德鲁))

如何使用spark2.0的数据集?

我有办法使用自定义功能,但感觉很复杂,有没有简单的点法?

3 个答案:

答案 0 :(得分:4)

我建议你先创建一个case class作为

case class Monkey(city: String, firstName: String)

case class应在主类之外定义。然后,您只需使用toDS功能,并使用名为groupBy的{​​{1}}和aggregation功能,如下所示

collect_list

您的输出为

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val test = Seq(("New York", "Jack"),
  ("Los Angeles", "Tom"),
  ("Chicago", "David"),
  ("Houston", "John"),
  ("Detroit", "Michael"),
  ("Chicago", "Andrew"),
  ("Detroit", "Peter"),
  ("Detroit", "George")
)
sc.parallelize(test)
  .map(row => Monkey(row._1, row._2))
  .toDS()
  .groupBy("city")
  .agg(collect_list("firstName") as "list")
  .show(false)

您只需拨打+-----------+------------------------+ |city |list | +-----------+------------------------+ |Los Angeles|[Tom] | |Detroit |[Michael, Peter, George]| |Chicago |[David, Andrew] | |Houston |[John] | |New York |[Jack] | +-----------+------------------------+ 功能

即可转换回RDD

答案 1 :(得分:1)

要创建数据集,首先要将类外的案例类定义为

case class Employee(city: String, name: String)

然后您可以将列表转换为数据集

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()
    import spark.implicits._
    val test = Seq(("New York", "Jack"),
    ("Los Angeles", "Tom"),
    ("Chicago", "David"),
    ("Houston", "John"),
    ("Detroit", "Michael"),
    ("Chicago", "Andrew"),
    ("Detroit", "Peter"),
    ("Detroit", "George")
    ).toDF("city", "name")
    val data = test.as[Employee]

    import spark.implicits._
    val test = Seq(("New York", "Jack"),
      ("Los Angeles", "Tom"),
      ("Chicago", "David"),
      ("Houston", "John"),
      ("Detroit", "Michael"),
      ("Chicago", "Andrew"),
      ("Detroit", "Peter"),
      ("Detroit", "George")
    )

    val data = test.map(r => Employee(r._1, r._2)).toDS()

现在您可以groupby执行任何聚合

data.groupBy("city").count().show

data.groupBy("city").agg(collect_list("name")).show

希望这有帮助!

答案 2 :(得分:0)

首先,我将您的RDD转换为数据集:

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val testDs = test.toDS()

在这里你得到你的名字:)明智地使用它!

testDs.schema.fields.foreach(x => println(x))

最后,你只需要使用groupBy:

testDs.groupBy("City?", "Name?")

RDD-s并不是我认为的2.0版本。 如果您有任何疑问,请询问。