Spark:基于列进行聚合

时间:2016-04-26 22:37:27

标签: scala apache-spark apache-spark-sql

我有一个由3个字段组成的文件(Emp_ids,Groups,Salaries)

  • 100 A 430
  • 101 A 500
  • 201 B 300

我希望得到结果

1)组名和计数(*)

2)组名和最高(工资)

val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc=  new SparkContext( conf)
val sal= sc.textFile(myfile) 

1 个答案:

答案 0 :(得分:2)

Scala DSL:

case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
   val arr = v.split(' ').map(_.trim())
   Data(arr(0).toInt, arr(1), arr(2).toInt)
  })
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
|  100|    A|   430|
|  101|    A|   500|
|  201|    B|   300|
+-----+-----+------+

df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
|    A|    2|
|    B|    1|
+-----+-----+


df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
|    A|      500|
|    B|      300|
+-----+---------+

或使用纯SQL:

df.registerTempTable("salaries")

sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
|    A|    2|
|    B|    1|
+-----+-----+

sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
|    A|      500|
|    B|      300|
+-----+---------+

虽然由于性能原因,建议使用Spark SQL进行此类聚合,但可以使用RDD API轻松完成:

val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))

rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))

rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))