如何在Apache Spark的Scala中按数据集分组?

时间:2019-06-18 17:28:45

标签: sql scala apache-spark group-by dataset

我想按字符串的第一部分对数据集进行分组。 因此,按“ SC Freiburg”,“ Arsenal”等进行分组... 另外,除了分组依据外,我还需要对分组的数量进行计数。

scala> res61.foreach(println)
SC Freiburg,2014,Germany,7747
Arsenal,2014,Germany,7745
Arsenal,2014,Germany,7750
Arsenal,2014,Germany,7758
Bayern Munich,2014,Germany,7737
Bayern Munich,2014,Germany,7744
Bayern Munich,2014,Germany,7746
Bayern Munich,2014,Germany,7749
Bayern Munich,2014,Germany,7752
Bayern Munich,2014,Germany,7754
Bayern Munich,2014,Germany,7755
Borussia Dortmund,2014,Germany,7739
Borussia Dortmund,2014,Germany,7740
Borussia Dortmund,2014,Germany,7742
Borussia Dortmund,2014,Germany,7743
Borussia Dortmund,2014,Germany,7756
Borussia Mönchengladbach,2014,Germany,7757
Schalke 04,2014,Germany,7741
Schalke 04,2014,Germany,7753
Chelsea,2014,Germany,7751
Hannover 96,2014,Germany,7738
Real Madrid,2014,Germany,7748
Lazio,2014,Germany,7759

提示:我必须使用rdd操作,请不要建议使用数据框 我看过这篇文章:spark dataset group by and sum 但是我不知道在我的示例中重现它。

这是我的Postgresql数据库中的结果输出:

result

3 个答案:

答案 0 :(得分:3)

RDD为此具有groupBy()和groupByKey()方法。例如,您可以执行组计数:

val str ="""SC Freiburg,2014,Germany,7747
   Arsenal,2014,Germany,7745
   ...
"""
val rdd = sc.parallelize(str.split("\n"))
rdd.map (_.split(",")).keyBy(_(0)).groupByKey().map {case (k, v) => (k, v.size)}.collect

答案 1 :(得分:2)

mycsv是文件的csv。

groupByKey(_._1.toLowerCase)

是您所需要的。


注意: 使用大数据的RDD方法是性能瓶颈,因为如果您使用数据帧数据集,它将使用java序列化钨将用作内部存储格式。 random number engine

package com.examples

import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, KeyValueGroupedDataset, SparkSession}


object DataSetGroupTest {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)

def main(args: Array[String]) {

 val spark = SparkSession.builder.
   master("local")
   .appName("DataSetGroupTest")
   .getOrCreate()

 import spark.implicits._
 // if you have a file
 val csvData: Dataset[String] = spark.read.text("mycsv.csv").as[String]

 csvData.show(false)
 //csvData.foreach(println(_))
 val words: Dataset[Array[String]] = csvData.map(value => value.split(","))
 println("convert to array")
 val finalwords: Dataset[(String, String, String, String)] = words.map { case Array(f1, f2, f3, f4) => (f1, f2, f3, f4) }
 finalwords.foreach(println(_))
 val groupedWords: KeyValueGroupedDataset[String, (String, String, String, String)] = finalwords.groupByKey(_._1.toLowerCase)
 val counts: Dataset[(String, Long)] = groupedWords.count().sort($"count(1)".desc)
 counts.show(false)
}
}

结果:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+------------------------------------------+
|value                                     |
+------------------------------------------+
|Freiburg,2014,Germany,7747                |
|Arsenal,2014,Germany,7745                 |
|Arsenal,2014,Germany,7750                 |
|Arsenal,2014,Germany,7758                 |
|Bayern Munich,2014,Germany,7737           |
|Bayern Munich,2014,Germany,7744           |
|Bayern Munich,2014,Germany,7746           |
|Bayern Munich,2014,Germany,7749           |
|Bayern Munich,2014,Germany,7752           |
|Bayern Munich,2014,Germany,7754           |
|Bayern Munich,2014,Germany,7755           |
|Borussia Dortmund,2014,Germany,7739       |
|Borussia Dortmund,2014,Germany,7740       |
|Borussia Dortmund,2014,Germany,7742       |
|Borussia Dortmund,2014,Germany,7743       |
|Borussia Dortmund,2014,Germany,7756       |
|Borussia Mönchengladbach,2014,Germany,7757|
|Schalke 04,2014,Germany,7741              |
|Schalke 04,2014,Germany,7753              |
|Chelsea,2014,Germany,7751                 |
+------------------------------------------+
only showing top 20 rows

convert to array
(Freiburg,2014,Germany,7747)
(Arsenal,2014,Germany,7745)
(Arsenal,2014,Germany,7750)
(Arsenal,2014,Germany,7758)
(Bayern Munich,2014,Germany,7737)
(Bayern Munich,2014,Germany,7744)
(Bayern Munich,2014,Germany,7746)
(Bayern Munich,2014,Germany,7749)
(Bayern Munich,2014,Germany,7752)
(Bayern Munich,2014,Germany,7754)
(Bayern Munich,2014,Germany,7755)
(Borussia Dortmund,2014,Germany,7739)
(Borussia Dortmund,2014,Germany,7740)
(Borussia Dortmund,2014,Germany,7742)
(Borussia Dortmund,2014,Germany,7743)
(Borussia Dortmund,2014,Germany,7756)
(Borussia Mönchengladbach,2014,Germany,7757)
(Schalke 04,2014,Germany,7741)
(Schalke 04,2014,Germany,7753)
(Chelsea,2014,Germany,7751)
(Hannover 96,2014,Germany,7738)
(Real Madrid,2014,Germany,7748)
(Lazio,2014,Germany,7759)
                                                                               +------------------------+--------+
|value                   |count(1)|
+------------------------+--------+
|bayern munich           |7       |
|borussia dortmund       |5       |
|arsenal                 |3       |
|schalke 04              |2       |
|lazio                   |1       |
|hannover 96             |1       |
|chelsea                 |1       |
|real madrid             |1       |
|freiburg                |1       |
|borussia mönchengladbach|1       |
+------------------------+--------+

答案 2 :(得分:1)

假设“ yourrdd”代表您之前显示的数据,则可以使用如下所示的方法得出结果。

yourrdd.groupBy(_(0)).map(x => (x._1,x._2.size)).sortBy((x => x._2),false).collect.foreach(println)