我使用spark 2.1.1和scala 2.11.8遇到了一个奇怪的行为:
import spark.implicits._
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
df.show(false)
df.printSchema()
+---+-------+
|id |data |
+---+-------+
|1 |[[a,b]]|
|2 |[[c,d]]|
+---+-------+
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
现在我想按照https://stackoverflow.com/a/39781382/1138523
中的建议重命名我的结构字段df
.select($"id",$"data".cast("array<struct<k:string,v:string>>"))
.show()
这导致了正确的架构,但现在是数据框的内容:
+---+-------+
| id| data|
+---+-------+
| 1|[[c,d]]|
| 2|[[c,d]]|
+---+-------+
这两行现在显示相同的数组。我做错了什么?
编辑:在火花2.1.2(以及火花2.3.0)中,我得到了预期的输出。如果我缓存数据帧,我也得到预期的输出:
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
.cache