spark sql条件最大值

时间:2017-06-14 14:48:01

标签: scala apache-spark apache-spark-sql window-functions

我有一张高桌,每组最多包含10个值。如何将此表格转换为宽格式,即添加2列,其中这些值类似于小于或等于阈值的值?

我想找到每组的最大值,但它需要小于指定值,如:

min(max('value1), lit(5)).over(Window.partitionBy('grouping))

但是min()只适用于列,而不适用于从内部函数返回的Scala值?

问题可以描述为:

Seq(Seq(1,2,3,4).max,5).min

窗口返回Seq(1,2,3,4)

我如何在spark sql中制定这个?

修改

E.g。

+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
|       1|    1|    first|
|       1|    2|   second|
|       1|    3|    third|
|       1|    4|   fourth|
|       1|    7|        7|
|       1|   10|       10|
|      21|    1|    first|
|      21|    2|   second|
|      21|    3|    third|
+--------+-----+---------+

创建

case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS

其中

df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show

返回

+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
|       1|    1|    first|                          10|                       10|
|       1|    2|   second|                          10|                       10|
|       1|    3|    third|                          10|                       10|
|       1|    4|   fourth|                          10|                       10|
|       1|    7|        7|                          10|                       10|
|       1|   10|       10|                          10|                       10|
|      21|    1|    first|                           3|                        3|
|      21|    2|   second|                           3|                        3|
|      21|    3|    third|                           3|                        3|
+--------+-----+---------+----------------------------+-------------------------+

相反,我更愿意制定:

lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))

但这不起作用,因为max('value)不是标量值。

预期输出应该看起来像

+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
|       1|    4|   fourth|                           4|                        7|
|      21|    1|    first|                           3|                     NULL|
+--------+-----+---------+----------------------------+-------------------------+

EDIT2

尝试转轴时

df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping|    1|     2|    3|     4|   7|  10|
+--------+-----+------+-----+------+----+----+
|       1|first|second|third|fourth|   7|  10|
|      21|first|second|third|  null|null|null|
+--------+-----+------+-----+------+----+----+

问题的第二部分仍然是某些列可能不存在或为空。

汇总到数组时:

df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping|              value|           something|
+--------+-------------------+--------------------+
|       1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
|      21|          [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+

值已经彼此相邻,但需要选择正确的值。这可能仍然比连接或窗口函数更有效。

1 个答案:

答案 0 :(得分:3)

在两个单独的步骤中更容易做到 - 在Window上计算max,然后在结果上使用when...otherwise来生成min(x, 5)

df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
  .withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))

编辑:一些示例数据可以澄清这一点:

val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
  .toDF("grouping", "value1")

df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
  .withColumn("result", when('result > lit(5), 5).otherwise('result))
  .show()

// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// |       1|     1|     4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// |       1|     2|     4|
// |       1|     3|     4|
// |       1|     4|     4|
// |       2|     7|     5| // 5, because Seq(Seq(7,8).max,5).min = 5
// |       2|     8|     5|
// +--------+------+------+