引起:java.lang.ClassCastException:Person无法强制转换为Person

时间:2017-07-29 03:29:34

标签: scala apache-spark

我在docker all-spark-notebook上测试spark应用程序,Scala代码是:

val p = spark.sparkContext.textFile ("../Data/person.txt")
val pmap = p.map ( _.split (","))
pmap.collect()

输出是: Array(Array(Barack, Obama, 53), Array(George, Bush, 68), Array(Bill, Clinton, 68))

case class Person (first_name:String,last_name: String,age:Int)
val personRDD = pmap.map ( p => Person (p(0), p(1), p(2).toInt))
val personDF = personRDD.toDF
personDF.collect()

错误消息在上面:

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 1 in stage 12.0 failed 1 times, most recent failure: Lost task 1.0 in stage 12.0 (TID 17, localhost, executor driver): java.lang.ClassCastException: $line145.$read$$iw$$iw$Person cannot be cast to $line145.$read$$iw$$iw$Person
    ................
Caused by: java.lang.ClassCastException: Person cannot be cast to Person

实际上,我尝试使用spark-shell运行此代码,此代码正确运行。我推测上面的错误消息与docker环境有关,但与代码本身无关。 另外,我试图通过以下方式展示personRDD:

personRDD.collect 

我收到了错误消息:

org.apache.spark.SparkDriverExecutionException: Execution error
  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1186)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
  ... 37 elided
Caused by: java.lang.ArrayStoreException: [LPerson;
  at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
  at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

我无法找出产生此问题的原因。有人能给我一些线索吗?感谢。

1 个答案:

答案 0 :(得分:1)

正如cricket_007在评论中建议使用sqlContext,您应该使用sparkSQL

header作为

的输入数据文件
first_name,last_name,age
Barack,Obama,53
George,Bush,68
Bill,Clinton,68

您可以执行以下操作

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .load("../Data/person.txt")

dataframe作为

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|Barack    |Obama    |53 |
|George    |Bush     |68 |
|Bill      |Clinton  |68 |
+----------+---------+---+

schema生成为

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: string (nullable = true

您可以定义schema并将schema应用为

val schema = StructType(Array(StructField("first_name", StringType, true), StructField("last_name", StringType, true), StructField("age", IntegerType, true)))

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .option("inferSchema", "true")
  .schema(schema)
  .load("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/t1.csv")

您应该schema作为

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: integer (nullable = true)

如果您的文件中没有header,则可以删除header option