如何使用案例类将简单的DataFrame转换为DataSet Spark Scala?

时间:2017-07-10 16:45:24

标签: scala apache-spark apache-spark-sql

我正在尝试从Spark中的示例将简单的DataFrame转换为DataSet: https://spark.apache.org/docs/latest/sql-programming-guide.html

case class Person(name: String, age: Int)    
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

但出现以下问题:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....

任何人都可以帮助我吗?

编辑 我注意到用Long而不是Int工作! 那是为什么?

此外:

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

打印:

+-----+---+
|   _1| _2|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];

任何人都可以帮助我理解这里吗?

2 个答案:

答案 0 :(得分:4)

如果将Int更改为Long(或BigInt),它可以正常工作:

case class Person(name: String, age: Long)
import spark.implicits._

val path = "examples/src/main/resources/people.json"

val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()

输出:

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

编辑: 默认情况下,Spark.read.json会将数字解析为Long类型 - 这样做更安全。 您可以在使用cast或udfs后更改col类型。

EDIT2:

要回答第二个问题,您需要在转换为Person之前正确命名列:

val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
 withColumnRenamed ("_1", "name" ).
 withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()

输出:

+-----+---+
| name|age|
+-----+---+
|var_1|  2|
|var_2|  3|
|var_3|  4|
+-----+---+

答案 1 :(得分:1)

这是从案例类

创建数据集的方法
case class Person(name: String, age: Long) 

将案例类保留在代码

之下的类之外
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()

augmentedDS.as[Person].show()

希望这有帮助