如何解压缩Spark DataSet中的多个键

时间:2017-03-19 22:49:14

标签: scala apache-spark apache-spark-dataset

我有以下DataSet,具有以下结构。

case class Person(age: Int, gender: String, salary: Double)

我想确定genderage平均薪水,因此我将这两个键分组为DS。我遇到了两个主要问题,一个是两个键混合在一个,但我想将它们保存在两个不同的列中,另一个是aggregated列得到一个愚蠢的长名称,我可以使用as找出如何重命名它(显然aliasDS API将不起作用)。

val df = sc.parallelize(List(Person(100000.00, "male", 27), 
  Person(120000.00, "male", 27), 
  Person(95000, "male", 26),
  Person(89000, "female", 31),
  Person(250000, "female", 51),
  Person(120000, "female", 51)
)).toDF.as[Person]

df.groupByKey(p => (p.gender, p.age)).agg(typed.avg(_.salary)).show()

+-----------+------------------------------------------------------------------------------------------------+
|        key| TypedAverage(line2503618a50834b67a4b132d1b8d2310b12.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Person)|          
+-----------+------------------------------------------------------------------------------------------------+ 
|[female,31]|  89000.0... 
|[female,51]| 185000.0...
|  [male,27]| 110000.0...
|  [male,26]|  95000.0...
+-----------+------------------------------------------------------------------------------------------------+

2 个答案:

答案 0 :(得分:4)

别名是一种无类型操作,因此您必须在之后重新键入。解开密钥的唯一方法是通过选择或其他方式完成:

awk 'NR==FNR {ks[NR]=$1; vs[NR]=$0; next}
             {for(k=length(ks); k>0; k--) if(k==$1) {print vs[k]; next}
              print "not found:", $1}' file2 file1

答案 1 :(得分:2)

实现这两个目标的最简单方法是从聚合结果再次map()Person实例:

.map{case ((gender, age), salary) => Person(gender, age, salary)}

如果在case类的构造函数中稍微重新排列参数的顺序,结果看起来最好:

case class Person(gender: String, age: Int, salary: Double)
+------+---+--------+
|gender|age|  salary|
+------+---+--------+
|female| 31| 89000.0|
|female| 51|185000.0|
|  male| 27|110000.0|
|  male| 26| 95000.0|
+------+---+--------+

完整代码:

import session.implicits._
val df = session.sparkContext.parallelize(List(
  Person("male", 27, 100000),
  Person("male", 27, 120000),
  Person("male", 26, 95000),
  Person("female", 31, 89000),
  Person("female", 51, 250000),
  Person("female", 51, 120000)
)).toDS

import org.apache.spark.sql.expressions.scalalang.typed
df.groupByKey(p => (p.gender, p.age))
  .agg(typed.avg(_.salary))
  .map{case ((gender, age), salary) => Person(gender, age, salary)}
  .show()