我有一个使用rdd的请求:
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
结果是:
(纽约,名单(杰克))
(底特律,名单(迈克尔,彼得,乔治))
(洛杉矶,名单(汤姆))
(休斯敦,列表(约翰))
(芝加哥,名单(大卫,安德鲁))
如何使用spark2.0的数据集?
我有办法使用自定义功能,但感觉很复杂,有没有简单的点法?
答案 0 :(得分:4)
我建议你先创建一个case class
作为
case class Monkey(city: String, firstName: String)
此case class
应在主类之外定义。然后,您只需使用toDS
功能,并使用名为groupBy
的{{1}}和aggregation
功能,如下所示
collect_list
您的输出为
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
sc.parallelize(test)
.map(row => Monkey(row._1, row._2))
.toDS()
.groupBy("city")
.agg(collect_list("firstName") as "list")
.show(false)
您只需拨打+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
功能
RDD
答案 1 :(得分:1)
要创建数据集,首先要将类外的案例类定义为
case class Employee(city: String, name: String)
然后您可以将列表转换为数据集
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
).toDF("city", "name")
val data = test.as[Employee]
或
import spark.implicits._
val test = Seq(("New York", "Jack"),
("Los Angeles", "Tom"),
("Chicago", "David"),
("Houston", "John"),
("Detroit", "Michael"),
("Chicago", "Andrew"),
("Detroit", "Peter"),
("Detroit", "George")
)
val data = test.map(r => Employee(r._1, r._2)).toDS()
现在您可以groupby
执行任何聚合
data.groupBy("city").count().show
data.groupBy("city").agg(collect_list("name")).show
希望这有帮助!
答案 2 :(得分:0)
首先,我将您的RDD转换为数据集:
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val testDs = test.toDS()
testDs.schema.fields.foreach(x => println(x))
最后,你只需要使用groupBy:
testDs.groupBy("City?", "Name?")
RDD-s并不是我认为的2.0版本。 如果您有任何疑问,请询问。