Question

我正在尝试将json读入数据集（火花2.3.2）。不幸的是，它不能很好地工作。

这是数据，它是一个带有内部数组的json文件

{“名称”：“ helloworld”，“信息”：{“ privateInfo”：[{“薪水”：1200}，{“性别”：“ M”}]]，“房屋”：“天空   road“}，” otherinfo“：2}
    {“名称”：“ helloworld2”，   “ info”：{“ privateInfo”：[{“ sex”：“ M”}]]，“ house”：“ sky road”}，   “ otherinfo”：3}

我使用sparksession来选择列，但它存在一些问题，其结果不是数据本身，而是数组中的数据。

val sqlDF = spark.sql("SELECT name , info.privateInfo.salary ,info.privateInfo.sex   FROM people1 ")
    sqlDF.show()

但是coloum的薪水和性别却很复杂

+-----------+-------+-----+
|       name| salary|  sex|
+-----------+-------+-----+
| helloworld|[1200,]|[, M]|
|helloworld2|     []|  [M]|
+-----------+-------+-----+

如何获取具有数据类型本身的数据？
例如

+-----------+-------+-----+
|       name| salary|  sex|
+-----------+-------+-----+
| helloworld|  1200 |  M  |
|helloworld2|none/null| M |
+-----------+-------+-----+

Answer 1

简短回答

spark.sql("SELECT name , " +
      "element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary ," +
      "element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex" +
      "   FROM people1 ")

+-----------+------+---+
|       name|salary|sex|
+-----------+------+---+
| helloworld|  1200|  M|
|helloworld2|  null|  M|
+-----------+------+---+

长答案
主要关注的是数组的可空性

root
 |-- Name: string (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- house: string (nullable = true)
 |    |-- privateInfo: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- salary: long (nullable = true)
 |    |    |    |-- sex: string (nullable = true)
 |-- otherinfo: long (nullable = true)

因此，我们需要一种过滤空值的方法，幸运的是，火花2.4具有内置的Higher-Order Functions

第一次尝试使用array_remove，但不幸的是null永远不能等于null。
使用更详细的语法还是有可能的

df.selectExpr("filter(info.privateInfo.salary, salary -> salary is not null)")

+------+
|salary|
+------+
|[1200]|
|    []|
+------+

现在我们需要某种方法来分解数组，幸运的是我们spark具有explode功能！

df.selectExpr(
 "explode(filter(info.privateInfo.salary, salary -> salary is not null)) AS salary",
 "explode(filter(info.privateInfo.sex, sex -> sex is not null)) AS sex")

景气

Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2

我们知道数组中应该只有一个值，我们可以使用element_at

 df.selectExpr(
      "element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary",
      "element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex")

p.s。没注意到10个月前有人问过

sparksql使用内部数组读取json

1 个答案: