我有一些传入的数据作为rowValues,我将不得不应用特定的架构并创建一个数据框,这是我的代码:
val rowValues = List("12","F","1980-10-11,1980-10-11T10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
var fieldSchema = ListBuffer[StructField]()
fieldSchema += StructField("C0", IntegerType, true, null)
fieldSchema += StructField("C1", StringType, true, null)
fieldSchema += StructField("C2", TimestampType, true, null)
val schema = StructType(fieldSchema.toList)
val newRow = sqlContext.createDataFrame(rowRdd, schema)
newRow.printSchema() // new schema prints here
newRow.show() // This fails with ClassCast exception
此操作因org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 16, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.sql.Timestamp
如何应用此架构?
答案 0 :(得分:2)
您可以将数据框中的列强制转换为架构
,而不是应用架构您可以使用withColumn和cast函数来更改列的数据类型
以下是简单示例
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
("12","F","1980-10-11T10:10:20"),
("12","F","1980-10-11T10:10:20")
)).toDF("c0", "c1", "c2")
val newDf = df.withColumn("c0", df("c0").cast(IntegerType))
.withColumn("c2", df("c2").cast(TimestampType))
//cast string date to timestamp
val newDf = df.withColumn("c0", df("c0").cast(IntegerType))
.withColumn("c2", to_utc_timestamp(df("c2"), "yyyy-MM-dd HH:mm:ss"))
//to_utc_timestamp creates a timestamp form given column and date format
newDf.show(false)
newDf.printSchema()
希望这有帮助!
答案 1 :(得分:1)
您的输入数据都是字符串,但c0
的架构为integer
,c1
为string
且c2
为timestamp
,因此你得到了投射错误。你的字符串timestamp
看起来更复杂。
如果您只是想要获得dataframe
,那么您应该将所有列datatypes
更改为string
并且它会正常工作
fieldSchema += StructField("C0", StringType, true, null)
fieldSchema += StructField("C1", StringType, true, null)
fieldSchema += StructField("C2", StringType, true, null)
你应该
+---+---+------------------------------+
|C0 |C1 |C2 |
+---+---+------------------------------+
|12 |F |1980-10-11,1980-10-11T10:10:20|
+---+---+------------------------------+
如果您坚持使用您的架构,则以下代码应该提供更好的想法
val rowValues = List("12","F","1980-10-11,1980-10-11T10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v(0).toInt, v(1), v(2).split(",")(1).replace("T", " ")))
var fieldSchema = ListBuffer[StructField]()
fieldSchema += StructField("C0", IntegerType, true)
fieldSchema += StructField("C1", StringType, true)
fieldSchema += StructField("C2", StringType, true)
val schema = StructType(fieldSchema.toList)
val newRow = sqlContext.createDataFrame(rowRdd, schema).withColumn("C2", unix_timestamp(col("C2")))
newRow.printSchema() // new schema prints here
newRow.show(false)
您也可以将案例类用作
import sqlContext.implicits._
def convertToDate(dateTime: String): Timestamp = {
val formatter = new SimpleDateFormat("yyyy-mm-dd hh:mm:ss")
val utilDate = formatter.parse(dateTime)
new Timestamp(utilDate.getTime)
}
val rowValues = List("12","F","1980-10-11 10:10:20")
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Pratap(v(0).toInt, v(1), convertToDate(v(2))))
val newRow = rowRdd.toDF
newRow.printSchema()
newRow.show(false)
你的案例类应该在主类之外作为
case class Pratap(C0: Int, C1: String, C2: java.sql.Timestamp)