我尝试使用spark 2.2.0
数据集withColumn
功能,但我得到了一个奇怪的行为。它至少会使用不同的参数顺序从case class
生成不同的模式,从而无法使用sql
dataset.union(other)
函数
示例代码:
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)
val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark.createDataset(Seq(OnlyAge(1))).withColumn("name", lit("henriquedsg89")).as[NameAge]
ds1.show()
ds2.show()
输出:
+----+---+
|name|age|
+----+---+
+----+---+
+---+-------------+
|age| name|
+---+-------------+
| 1|henriquedsg89|
+---+-------------+
这是预期的行为吗?它不应该订购架构列,如case class
参数订单吗?
我需要按照相反的顺序订购我的参数,以使其有效,例如:
case class NameAge(age: Int, name: String)
如果我尝试联合这两个数据集时出错(由于列顺序,它试图将年龄与名称联合起来):
ds1.union(ds2)
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"
You can either add an explicit cast to the input data or choose a higher
目标对象中字段的精度类型;