如何使用" cube"仅适用于Spark数据帧的特定字段?

时间:2016-11-23 11:30:31

标签: scala apache-spark dataframe apache-spark-sql cube

我使用Spark 1.6.1,我有这样的数据帧。

|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|


(按所有字段分组,但仅限#34; os_name","国家"," app_ver"字段为立方体)

|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|cnt|
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|  9|
|    test_home|scene_enter|        test_home|   null|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test| 35|
|    test_home|scene_enter|        test_home|android|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test| 98|
|    test_home|scene_enter|        test_home|android|     KR|   null|__OTHERS__|  false|   test|   test|   test|101|
|    test_home|scene_enter|        test_home|   null|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test|301|
|    test_home|scene_enter|        test_home|   null|     KR|   null|__OTHERS__|  false|   test|   test|   test|225|
|    test_home|scene_enter|        test_home|android|   null|   null|__OTHERS__|  false|   test|   test|   test|312|
|    test_home|scene_enter|        test_home|   null|   null|   null|__OTHERS__|  false|   test|   test|   test|521|


var cubed = df
  .cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
  .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")


1 个答案:

答案 0 :(得分:4)



val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")


import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")


val cubed =  Seq($"d", $"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count


tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)


|  a|  b|  c|   d|   e|count|
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|