我正在尝试从Spark中的示例将简单的DataFrame转换为DataSet: https://spark.apache.org/docs/latest/sql-programming-guide.html
case class Person(name: String, age: Int)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
但出现以下问题:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....
任何人都可以帮助我吗?
编辑 我注意到用Long而不是Int工作! 那是为什么?
此外:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
打印:
+-----+---+
| _1| _2|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];
任何人都可以帮助我理解这里吗?
答案 0 :(得分:4)
如果将Int更改为Long(或BigInt),它可以正常工作:
case class Person(name: String, age: Long)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
输出:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
编辑:
默认情况下,Spark.read.json
会将数字解析为Long
类型 - 这样做更安全。
您可以在使用cast或udfs后更改col类型。
EDIT2:
要回答第二个问题,您需要在转换为Person之前正确命名列:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
withColumnRenamed ("_1", "name" ).
withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()
输出:
+-----+---+
| name|age|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
答案 1 :(得分:1)
这是从案例类
创建数据集的方法case class Person(name: String, age: Long)
将案例类保留在代码
之下的类之外val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
希望这有帮助