Question

我最近开始玩Spark SQL（2.1），我正在处理嵌套数据。

这是我的架构：

 root
 |-- a: string (nullable = true)
 |-- b: map (nullable = true)
 |    |-- bb: string
 |    |-- bbb: string (valueContainsNull = true)
 |-- c: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cc: map (nullable = true)
 |    |    |    |-- cca: string
 |    |    |    |-- ccb: struct (valueContainsNull = true)
 |    |    |    |    |-- member0: string (nullable = true)
 |    |    |    |    |-- member1: long (nullable = true)
 |    |    |-- ccc: map (nullable = true)
 |    |    |    |-- ccca: string
 |    |    |    |-- cccb: string (valueContainsNull = true)
 |    |    |-- cccc: map (nullable = true)
 |    |    |    |-- cccca: string
 |    |    |    |-- ccccb: string (valueContainsNull = true)

我正在尝试按如下方式过滤我的数据：将所有行保留在c.ccc.key =='data'

我发现databricks文档中存在非常相关的功能。但我想知道databricks笔记本外面是否有类似的东西？

https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html#exists-array-t-function-t-v-boolean-boolean

我愿意使用sql或以编程方式执行，只是不确定数据帧是不是类型化对象。

阅读此电子邮件主题http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html，似乎所有人都可以使用databricks的高阶函数。但我想知道是否有任何人可以分享的中间解决方案？

Answer 1

如果您的dataframe有schema为

root
 |-- a: string (nullable = true)
 |-- b: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- c: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- cc: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: struct (valueContainsNull = true)
 |    |    |    |    |-- member0: string (nullable = true)
 |    |    |    |    |-- member1: string (nullable = true)
 |    |    |-- ccc: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- cccc: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

然后你可以写一个udf函数，如下所示

import org.apache.spark.sql.functions._
def filterUdf = udf((column: Seq[Row])=> column.map(x => x(1).asInstanceOf[Map[String, String]].keySet.contains("data")).contains(true))

会扫描c字符串的每一行data列是否存在，您可以使用udf函数中的filter函数

df.filter(filterUdf(col("c")))

所以最后你应该只有data

中c.ccc.key的行

Answer 2

谢谢！一些改进：

spark.sqlContext.udf.register("contains_key", (field: Seq[Row], key: String, value: String) => field.exists(item => item.getAs[Map[String, String]]("ccc").get(key).getOrElse("").equals(value)))

然后可以使用spark sql访问它：

spark.sql("select contains_key(c,"key","data") from mytable")

在Spark SQL / High order函数中使用嵌套数据

2 个答案: