Question

我有一个包含以下架构的数据框：

 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

由于map中的值是array类型，我需要在“foreign”键字段idkey中提取与id对应的字段索引。

例如，我有以下数据：

 {"A":{
 "innerkey_1":[{"id":"1","type":"0.01","index":true},
               {"id":"6","type":"4.3","index":false}]},
 "1"}

由于idkey为1，我们需要输出对应于"id":1元素的索引值，即索引应该等于true。我真的不确定如何使用UDF或其他方式实现这一目标。

预期输出为：

+---------+
| indexout|
+---------+
|   true  |
+---------+

Answer 1

如果您的数据框有以下schema

root
 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- types: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

然后你可以use two explode function，一个用于map，另一个用于内部array，使用filter来过滤匹配，最后select索引为

import org.apache.spark.sql.functions._
df.select(col("idkey"), explode(col("A")))
  .select(col("idkey"), explode(col("value")).as("value"))
  .filter(col("idkey") === col("value.id"))
  .select(col("value.index").as("indexout"))

你应该

+--------+
|indexout|
+--------+
|true    |
+--------+

使用udf功能

您可以使用udf功能执行上述操作，这样可以避免两个explode和filter。 所有爆炸和过滤都是在udf函数本身中完成的。您可以根据需要进行修改。

import org.apache.spark.sql.functions._
def indexoutUdf = udf((a: Map[String, Seq[Row]], idkey: String) => {
  a.map(x => x._2.filter(y => y.getAs[String](0) == idkey).map(y => y.getAs[Boolean](2))).toList(0).head
})
df.select(indexoutUdf(col("A"), col("idkey")).as("indexout")).show(false)

我希望答案很有帮助

根据Scala中的键从地图中的数组中获取值

1 个答案: