Question

我尝试使用spark 2.2.0数据集withColumn功能，但我得到了一个奇怪的行为。它至少会使用不同的参数顺序从case class生成不同的模式，从而无法使用sql

之类的dataset.union(other)函数

示例代码：

case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark.createDataset(Seq(OnlyAge(1))).withColumn("name", lit("henriquedsg89")).as[NameAge]

ds1.show()
ds2.show()

输出：

+----+---+
|name|age|
+----+---+
+----+---+

+---+-------------+
|age|         name|
+---+-------------+
|  1|henriquedsg89|
+---+-------------+

这是预期的行为吗？它不应该订购架构列，如case class参数订单吗？

我需要按照相反的顺序订购我的参数，以使其有效，例如： case class NameAge(age: Int, name: String)

如果我尝试联合这两个数据集时出错（由于列顺序，它试图将年龄与名称联合起来）：

ds1.union(ds2)

Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"
You can either add an explicit cast to the input data or choose a higher

目标对象中字段的精度类型;

Spark数据集withColumn生成不同的模式

0 个答案: