Spark查询地图的所有键

时间:2017-08-09 07:58:05

标签: apache-spark apache-spark-sql apache-spark-dataset

我有一个带有以下架构的镶木地板文件

 |-- Name: string (nullable = true)
 |-- Attendance: long (nullable = true)
 |-- Efficiency: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)

效率值的范围从-1到+1,关键是各种类别,如体育,学术等。我有多达20个不同的键。

我正在尝试获取按出勤率降序排列的前100名,效率[Key]小于0。 我能够为一把钥匙做这件事。但我无法弄清楚我应该如何同时为所有键实现这一点。 一个密钥的代码段:

spark.sql("select Name,Attendance,Efficiency['Sports'] from data where Efficiency['Sports'] < 0 order by Attendance desc limit 100")

在进行一些分析时,我发现我们需要爆炸我们的地图。但每当我爆炸时,表格中的行数就会上升,而我无法获取前100名。

一个密钥的示例数据。实际的表有一个地图,而不是这里看到的一列

+--------------------+------------------+-------------+                         
|Name                |Attendance        |Efficiency[Sports]|
+--------------------+------------------+-------------+
|A                   |1000              |0.002        |
|B                   |365               |0.0          |
|C                   |1080              |0.193        |
|D                   |245               |-0.002       |
|E                   |1080              |-0.515       |
|F                   |905               |0.0          |
|G                   |900               |-0.001       |

预期输出:每个键的100个名称列表

+-----------------------+--------------+                                        
|Sports                 |Academics     |
+-----------------------+--------------+
|A                      |A             |
|B                      |C             |
|C                      |D             |
|D                      |E             |

任何有关解决此问题的帮助都会非常有用

谢谢

2 个答案:

答案 0 :(得分:0)

我希望这就是你要找的东西

import org.apache.spark.sql.functions._

//dummy data
val d = Seq(
  ("a", 10, Map("Sports" -> -0.2, "Academics" -> 0.1)),
  ("b", 20, Map("Sports" -> -0.1, "Academics" -> -0.1)),
  ("c", 5, Map("Sports" -> -0.2, "Academics" -> 0.5)),
  ("d", 15, Map("Sports" -> -0.2, "Academics" -> 0.0))
).toDF("Name", "Attendence", "Efficiency")

//explode the map and get key value
val result = d.select($"Name", $"Attendence", explode($"Efficiency"))

//select value less than 0 and show 100
result.select("*").where($"value".lt(0))
  .sort($"Attendence".desc)
  .show(100)

输出:

+----+----------+---------+-----+
|Name|Attendence|key      |value|
+----+----------+---------+-----+
|b   |20        |Sports   |-0.1 |
|b   |20        |Academics|-0.1 |
|d   |15        |Sports   |-0.2 |
|a   |10        |Sports   |-0.2 |
|c   |5         |Sports   |-0.2 |
+----+----------+---------+-----+

希望这有帮助!

答案 1 :(得分:0)

将输入数据帧设为

+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency                               |
+----+----------+-----------------------------------------+
|A   |1000      |Map(Sports -> 0.002, Academics -> 0.002) |
|B   |365       |Map(Sports -> 0.0, Academics -> 0.0)     |
|C   |1080      |Map(Sports -> 0.193, Academics -> 0.193) |
|D   |245       |Map(Sports -> -0.002, Academics -> -0.46)|
|E   |1080      |Map(Sports -> -0.515, Academics -> -0.5) |
|F   |905       |Map(Sports -> 0.0, Academics -> 0.0)     |
|G   |900       |Map(Sports -> -0.001, Academics -> -0.0) |
+----+----------+-----------------------------------------+

使用udf函数迭代Map以检查小于零的值。这可以按照以下方式完成

import org.apache.spark.sql.functions._
val isLessThan0 = udf((maps: Map[String, Double]) => maps.map(x => x._2 < 0).toSeq.contains(true))

df.withColumn("lessThan0", isLessThan0('Efficiency))
    .filter($"lessThan0" === true)
    .orderBy($"Attendance".desc)
    .drop("lessThan0")
    .show(100, false)

您将输出为

+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency                               |
+----+----------+-----------------------------------------+
|E   |1080      |Map(Sports -> -0.515, Academics -> -0.5) |
|G   |900       |Map(Sports -> -0.001, Academics -> -0.0) |
|D   |245       |Map(Sports -> -0.002, Academics -> -0.46)|
+----+----------+-----------------------------------------+