创建具有模式的数据帧时的java.lang.ClassCastException

时间:2017-07-05 07:19:20

标签: apache-spark apache-spark-sql

我有一些传入的数据作为rowValues,我将不得不应用特定的架构并创建一个数据框,这是我的代码:

val rowValues = List("12","F","1980-10-11,1980-10-11T10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))

var fieldSchema = ListBuffer[StructField]()

fieldSchema += StructField("C0", IntegerType, true, null)
fieldSchema += StructField("C1", StringType, true, null)
fieldSchema += StructField("C2", TimestampType, true, null)
val schema = StructType(fieldSchema.toList)

val newRow = sqlContext.createDataFrame(rowRdd, schema)
newRow.printSchema()   // new schema prints here
newRow.show()   // This fails with ClassCast exception

此操作因org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 16, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.sql.Timestamp

而失败

如何应用此架构?

2 个答案:

答案 0 :(得分:2)

您可以将数据框中的列强制转换为架构

,而不是应用架构

您可以使用withColumn和cast函数来更改列的数据类型

以下是简单示例

import spark.implicits._

val df = spark.sparkContext.parallelize(Seq(
  ("12","F","1980-10-11T10:10:20"),
  ("12","F","1980-10-11T10:10:20")
)).toDF("c0", "c1", "c2")

val newDf = df.withColumn("c0", df("c0").cast(IntegerType))
  .withColumn("c2", df("c2").cast(TimestampType)) 
//cast string date to timestamp

val newDf = df.withColumn("c0", df("c0").cast(IntegerType))
  .withColumn("c2", to_utc_timestamp(df("c2"), "yyyy-MM-dd HH:mm:ss"))
//to_utc_timestamp creates a timestamp form given column and date format

newDf.show(false)

newDf.printSchema()

希望这有帮助!

答案 1 :(得分:1)

您的输入数据都是字符串,但c0的架构为integerc1stringc2timestamp,因此你得到了投射错误。你的字符串timestamp看起来更复杂。

如果您只是想要获得dataframe,那么您应该将所有列datatypes更改为string并且它会正常工作

fieldSchema += StructField("C0", StringType, true, null)
fieldSchema += StructField("C1", StringType, true, null)
fieldSchema += StructField("C2", StringType, true, null)
你应该

+---+---+------------------------------+
|C0 |C1 |C2                            |
+---+---+------------------------------+
|12 |F  |1980-10-11,1980-10-11T10:10:20|
+---+---+------------------------------+

如果您坚持使用您的架构,则以下代码应该提供更好的想法

val rowValues = List("12","F","1980-10-11,1980-10-11T10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))

val rowRdd = rdd.map(v => Row(v(0).toInt, v(1), v(2).split(",")(1).replace("T", " ")))

var fieldSchema = ListBuffer[StructField]()

fieldSchema += StructField("C0", IntegerType, true)
fieldSchema += StructField("C1", StringType, true)
fieldSchema += StructField("C2", StringType, true)
val schema = StructType(fieldSchema.toList)

val newRow = sqlContext.createDataFrame(rowRdd, schema).withColumn("C2", unix_timestamp(col("C2")))
newRow.printSchema()   // new schema prints here
newRow.show(false)

您也可以将案例类用作

import sqlContext.implicits._

def convertToDate(dateTime: String): Timestamp = {
  val formatter = new SimpleDateFormat("yyyy-mm-dd hh:mm:ss")
  val utilDate = formatter.parse(dateTime)
  new Timestamp(utilDate.getTime)
}
val rowValues = List("12","F","1980-10-11 10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))

val rowRdd = rdd.map(v => Pratap(v(0).toInt, v(1), convertToDate(v(2))))

val newRow = rowRdd.toDF
newRow.printSchema()   
newRow.show(false)

你的案例类应该在主类之外作为

case class Pratap(C0: Int, C1: String, C2: java.sql.Timestamp)