我有一个带有以下架构的镶木地板文件
|-- Name: string (nullable = true)
|-- Attendance: long (nullable = true)
|-- Efficiency: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
效率值的范围从-1到+1,关键是各种类别,如体育,学术等。我有多达20个不同的键。
我正在尝试获取按出勤率降序排列的前100名,效率[Key]小于0。 我能够为一把钥匙做这件事。但我无法弄清楚我应该如何同时为所有键实现这一点。 一个密钥的代码段:
spark.sql("select Name,Attendance,Efficiency['Sports'] from data where Efficiency['Sports'] < 0 order by Attendance desc limit 100")
在进行一些分析时,我发现我们需要爆炸我们的地图。但每当我爆炸时,表格中的行数就会上升,而我无法获取前100名。
一个密钥的示例数据。实际的表有一个地图,而不是这里看到的一列
+--------------------+------------------+-------------+
|Name |Attendance |Efficiency[Sports]|
+--------------------+------------------+-------------+
|A |1000 |0.002 |
|B |365 |0.0 |
|C |1080 |0.193 |
|D |245 |-0.002 |
|E |1080 |-0.515 |
|F |905 |0.0 |
|G |900 |-0.001 |
预期输出:每个键的100个名称列表
+-----------------------+--------------+
|Sports |Academics |
+-----------------------+--------------+
|A |A |
|B |C |
|C |D |
|D |E |
任何有关解决此问题的帮助都会非常有用
谢谢
答案 0 :(得分:0)
我希望这就是你要找的东西
import org.apache.spark.sql.functions._
//dummy data
val d = Seq(
("a", 10, Map("Sports" -> -0.2, "Academics" -> 0.1)),
("b", 20, Map("Sports" -> -0.1, "Academics" -> -0.1)),
("c", 5, Map("Sports" -> -0.2, "Academics" -> 0.5)),
("d", 15, Map("Sports" -> -0.2, "Academics" -> 0.0))
).toDF("Name", "Attendence", "Efficiency")
//explode the map and get key value
val result = d.select($"Name", $"Attendence", explode($"Efficiency"))
//select value less than 0 and show 100
result.select("*").where($"value".lt(0))
.sort($"Attendence".desc)
.show(100)
输出:
+----+----------+---------+-----+
|Name|Attendence|key |value|
+----+----------+---------+-----+
|b |20 |Sports |-0.1 |
|b |20 |Academics|-0.1 |
|d |15 |Sports |-0.2 |
|a |10 |Sports |-0.2 |
|c |5 |Sports |-0.2 |
+----+----------+---------+-----+
希望这有帮助!
答案 1 :(得分:0)
将输入数据帧设为
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|A |1000 |Map(Sports -> 0.002, Academics -> 0.002) |
|B |365 |Map(Sports -> 0.0, Academics -> 0.0) |
|C |1080 |Map(Sports -> 0.193, Academics -> 0.193) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|F |905 |Map(Sports -> 0.0, Academics -> 0.0) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
+----+----------+-----------------------------------------+
使用udf
函数迭代Map
以检查小于零的值。这可以按照以下方式完成
import org.apache.spark.sql.functions._
val isLessThan0 = udf((maps: Map[String, Double]) => maps.map(x => x._2 < 0).toSeq.contains(true))
df.withColumn("lessThan0", isLessThan0('Efficiency))
.filter($"lessThan0" === true)
.orderBy($"Attendance".desc)
.drop("lessThan0")
.show(100, false)
您将输出为
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
+----+----------+-----------------------------------------+