我正在尝试通过聚合进行分组。使用Spark 1.5.2
请告诉我为什么这不起作用。
in是一个数据框。
scala> in
res28: org.apache.spark.sql.DataFrame = [id: int, city: string]
scala> in.show
+---+--------+
| id| city|
+---+--------+
| 10|Bathinda|
| 20|Amritsar|
| 30|Bathinda|
+---+--------+
scala>in.groupBy("city").agg(Map{
| "id" -> "sum"
| }).show(true)
+----+-------+
|city|sum(id)|
+----+-------+
+----+-------+
谢谢,
我希望输出应该是城市和id的总和
编辑:我不知道为什么下次我创建新的spark-shell时它起作用答案 0 :(得分:2)
考虑以下DataFrame:
val in = sc.parallelize(Seq(
(10, "Bathinda"), (20, "Amritsar"), (30, "Bathinda"))).toDF("id", "city")
您可以看到这些代码行将提供相同的输出
scala> in.groupBy("city").agg(Map("id" -> "sum")).show
+--------+-------+
| city|sum(id)|
+--------+-------+
|Bathinda| 40|
|Amritsar| 20|
+--------+-------+
scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show
+--------+-------+
| city|sum(id)|
+--------+-------+
|Bathinda| 40|
|Amritsar| 20|
+--------+-------+
scala> in.groupBy("city").agg(Map{ "id" -> "sum"}).show(true)
+--------+-------+
| city|sum(id)|
+--------+-------+
|Bathinda| 40|
|Amritsar| 20|
+--------+-------+
scala> in.groupBy("city").agg(sum($"id")).show(true)
+--------+-------+
| city|sum(id)|
+--------+-------+
|Bathinda| 40|
|Amritsar| 20|
+--------+-------+
scala> in.groupBy("city").agg(sum(in("id"))).show(true)
+--------+-------+
| city|sum(id)|
+--------+-------+
|Bathinda| 40|
|Amritsar| 20|
+--------+-------+
注意:默认情况下show参数为false,它只关注是否显示整个字段值。 (有时候该字段太长而你只需要预览)