我正在尝试学习如何使用Spark处理JSON数据,并且我有一个相当简单的JSON文件,如下所示:
{"key": { "defaultWeights":"1" }, "measures": { "m1":-0.01, "m2":-0.5.....}}
当我将此文件加载到Spark数据框中并运行以下代码时:
val flattened = dff.withColumn("default_weights", json_tuple(col("key"), "defaultWeights")).show
我收到此错误:
org.apache.spark.sql.AnalysisException: cannot resolve 'json_tuple(`key`, 'defaultWeights')' due to data type mismatch: json_tuple requires that all arguments are strings;;
'Project [key#6, measures#7, json_tuple(key#6, defaultWeights) AS default_weights#13]
+- Relation[key#6,measures#7] json
如果我更改代码以确保两个参数都是字符串,则会出现此错误:
<console>:25: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
val flattened = dff.withColumn("default_weights", json_tuple("key", "defaultWeights")).show
如您所见,我实际上是在转圈!
答案 0 :(得分:1)
json_tuple
列是文本而不是结构,则 key
可以工作。让我告诉你:
val contentStruct =
"""|{"key": { "defaultWeights":"1", "c": "a" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
FileUtils.writeStringToFile(new File("/tmp/test_flat.json"), contentStruct)
val sparkSession: SparkSession = SparkSession.builder()
.appName("Spark SQL json_tuple")
.master("local[*]").getOrCreate()
import sparkSession.implicits._
sparkSession.read.json("/tmp/test_flat.json").printSchema()
模式将是:
root
|-- key: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- defaultWeights: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
事实上,您不需要额外的defaultWeights
。您可以将它们与JSON路径(key.defaultWeights
)一起使用:
sparkSession.read.json("/tmp/test_flat.json").select("key.defaultWeights").show()
+--------------+
|defaultWeights|
+--------------+
| 1|
+--------------+
否则,要使用json_tuple
,您的JSON应该如下所示:
val contentString =
"""|{"key": "{ \"defaultWeights\":\"1\", \"c\": \"a\" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
在这种情况下,架构将为:
root
|-- key: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
并且:
sparkSession.read.json("/tmp/test_flat.json")
.withColumn("default_weights", functions.json_tuple($"key", "defaultWeights")).show(false)
将返回:
+----------------------------------+-------------+---------------+
|key |measures |default_weights|
+----------------------------------+-------------+---------------+
|{ "defaultWeights":"1", "c": "a" }|[-0.01, -0.5]|1 |
+----------------------------------+-------------+---------------+