尝试在Spark中拼合JSON时出错

时间:2019-07-17 15:45:44

标签: json scala apache-spark

我正在尝试学习如何使用Spark处理JSON数据,并且我有一个相当简单的JSON文件,如下所示:

{"key": { "defaultWeights":"1" }, "measures": { "m1":-0.01, "m2":-0.5.....}}

当我将此文件加载到Spark数据框中并运行以下代码时:

val flattened = dff.withColumn("default_weights", json_tuple(col("key"), "defaultWeights")).show

我收到此错误:

org.apache.spark.sql.AnalysisException: cannot resolve 'json_tuple(`key`, 'defaultWeights')' due to data type mismatch: json_tuple requires that all arguments are strings;;
'Project [key#6, measures#7, json_tuple(key#6, defaultWeights) AS default_weights#13]
+- Relation[key#6,measures#7] json

如果我更改代码以确保两个参数都是字符串,则会出现此错误:

<console>:25: error: type mismatch;
 found   : String
 required: org.apache.spark.sql.Column
       val flattened = dff.withColumn("default_weights", json_tuple("key", "defaultWeights")).show

如您所见,我实际上是在转圈!

1 个答案:

答案 0 :(得分:1)

如果您的json_tuple列是文本而不是结构,则

key可以工作。让我告诉你:

val contentStruct =
"""|{"key": { "defaultWeights":"1", "c": "a" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
FileUtils.writeStringToFile(new File("/tmp/test_flat.json"), contentStruct)

val sparkSession: SparkSession = SparkSession.builder()
   .appName("Spark SQL json_tuple")
   .master("local[*]").getOrCreate()
import sparkSession.implicits._
sparkSession.read.json("/tmp/test_flat.json").printSchema()

模式将是:

root
 |-- key: struct (nullable = true)
 |    |-- c: string (nullable = true)
 |    |-- defaultWeights: string (nullable = true)
 |-- measures: struct (nullable = true)
 |    |-- m1: double (nullable = true)
 |    |-- m2: double (nullable = true)

事实上,您不需要额外的defaultWeights。您可以将它们与JSON路径(key.defaultWeights)一起使用:

sparkSession.read.json("/tmp/test_flat.json").select("key.defaultWeights").show()
+--------------+
|defaultWeights|
+--------------+
|             1|
+--------------+

否则,要使用json_tuple,您的JSON应该如下所示:

val contentString =
"""|{"key": "{ \"defaultWeights\":\"1\", \"c\": \"a\" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin

在这种情况下,架构将为:

root
 |-- key: string (nullable = true)
 |-- measures: struct (nullable = true)
 |    |-- m1: double (nullable = true)
 |    |-- m2: double (nullable = true)

并且:

sparkSession.read.json("/tmp/test_flat.json")
      .withColumn("default_weights", functions.json_tuple($"key", "defaultWeights")).show(false)

将返回:

+----------------------------------+-------------+---------------+
|key                               |measures     |default_weights|
+----------------------------------+-------------+---------------+
|{ "defaultWeights":"1", "c": "a" }|[-0.01, -0.5]|1              |
+----------------------------------+-------------+---------------+