Question

我正在尝试从json数据源创建一个表。

问题是json数据中有一个字段并不总是存在于每个条目中，看起来像这样。

UIMenuController.shared

当我尝试在模式中指定可选字段时，没有该字段的所有条目在表中都具有所有空值：

[ { "k1" : "someValue",
    "optK" : { "nestedK" : true } },
  { "k1" : "someOtherValue" }
]

是否可以编写一个模式，使得我只在缺少值的列中使用null？

像这样：

columns:  k1        |  optK
row1:   "someValue"    [true]
row2:    null           null

我目前的代码：

columns:  k1            |  optK
row1:   "someValue"       "optV"
row2:   "someOtherValue"   null

Answer 1

您的代码/输入数据存在以下几个问题：

输入数据 - JSON键不在引号中。

您可以通过以下选项之一使用避免此问题：

通过向json键添加引号来更新输入数据

以下列方式使用.option("allowUnquotedFieldNames",true)：

val df = session.read.option("allowUnquotedFieldNames",true).schema(schema).json("data.json")

输入数据中的string字段定义为架构中的boolean 架构应更新为：

val schema = StructType（Seq（ StructField（“k1”，StringType，false）， StructField（“optK”，StructType（Seq（StructField（“nestedK”，StringType，false））），false）））

JSON数据格式，我将示例json输入更新为json lines格式：

{ k1 : "someValue", optK : { nestedK : "optV" } }
{ k1 : "someOtherValue" }

运行修改代码显示以下内容：

Spark context available as 'sc' (master = yarn, app id = application_xxx).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)   
scala> :paste
// Entering paste mode (ctrl-D to finish)   
import org.apache.spark.sql.expressions.scalalang._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}    
val schema = StructType(Seq(
  StructField("k1", StringType, false),
  StructField("optK", StructType(Seq(StructField("nestedK", StringType, false))), false)
))    
val df = spark.read.option("allowUnquotedFieldNames",true).schema(schema).json("s3 location of data.json")       
// Exiting paste mode, now interpreting.

import org.apache.spark.sql.expressions.scalalang._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
schema: org.apache.spark.sql.types.StructType = StructType(StructField(k1,StringType,false), StructField(optK,StructType(StructField(nestedK,StringType,false)),false))
df: org.apache.spark.sql.DataFrame = [k1: string, optK: struct<nestedK: string>]

scala> df.show
+--------------+------+
|            k1|  optK|
+--------------+------+
|     someValue|[optV]|
|someOtherValue|  null|
+--------------+------+

使用可选值从json导入模式

1 个答案: