减少Spark中的数据框以省略空单元格

时间:2016-02-29 12:01:05

标签: scala apache-spark dataframe apache-spark-sql

我有一个数据框,如:

val df = sc.parallelize(List((1, 2012, 3, 5), (2, 2012, 4, 7), (1,2013, 1, 3), (2, 2013, 9, 5))).toDF("id", "year", "propA", "propB")

这段代码的灵感源自Pivot Spark Dataframe

import org.apache.spark.sql.functions._
import sq.implicits._
years = List("2012", "2013")
val numYears = years.length - 1
// 
var query2 = "select id, "
for (i <- 0 to numYears-1) {
    query2 += "case when year = " + years(i) + " then propA else 0 end as " + "propA" + years(i) + ", "
    query2 += "case when year = " + years(i) + " then propB else 0 end as " + "propB" + years(i) + ", "
}
query2 += "case when year = " + years.last + " then propA else 0 end as " + "propA" + years.last + ", "
query2 += "case when year = " + years.last + " then propB else 0 end as " + "propB" + years.last + " from myTable"
// 
df.registerTempTable("myTable")
//
val myDF1 = sq.sql(query2)

我设法得到:

    +---+---------+---------+---------+---------+
//| | id|propA2012|propB2012|propA2013|propB2013|
//| +---+---------+---------+---------+---------+
//| |  1|        3|        5|        0|        0|
//| |  2|        4|        7|        0|        0|
//| |  1|        0|        0|        1|        3|
//| |  2|        0|        0|        9|        5|
//| +---+---------+---------+---------+---------+

我设法减少到

id propA-2012 propB-2012 propA-2013 propB-2013
 1          3          5          1          3
 2          4          7          9          5

使用:

val df2 = myDF1.groupBy("id").agg(
                "propA2012" -> "sum",
                "propA2013" -> "sum",
                "propB2013" -> "sum",
                "propB2012" -> "sum") 

有没有办法在不指定列名的情况下迭代所有列?

1 个答案:

答案 0 :(得分:4)

从我的头脑中,我们可以使用聚合表达式列表来实现它:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.sum

val funs: List[(String => Column)] = List(sum)
val exprs = myDF1.dtypes.filter(_._1.contains("prop")).flatMap(ct => funs.map(fun => fun(ct._1))).toList

myDF1.groupBy('id).agg(exprs.head, exprs.tail :_*).show

# +---+--------------+--------------+--------------+--------------+
# | id|sum(propA2012)|sum(propB2012)|sum(propA2013)|sum(propB2013)|
# +---+--------------+--------------+--------------+--------------+
# |  1|             3|             5|             1|             3|
# |  2|             4|             7|             9|             5|
# +---+--------------+--------------+--------------+--------------+