类型不支持列的TimestampType

时间:2018-05-31 15:38:55

标签: scala apache-spark linear-regression

我使用带有scala的spark在TimestampType

类型中有问题
 object regressionLinear {
  case class X(
                time:String,nodeID: Int, posX: Double,posY: Double,
                speed: Double,period: Int)

  def main (args: Array[String]) {

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

/**
  * Read the input data
  */
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
  dataset = args(0)
}

val spark = SparkSession
  .builder
  .appName("regressionsol")
  .master("local[4]")
  .getOrCreate()

import spark.implicits._


val data = spark.sparkContext.textFile(dataset)
     .map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
      userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))


val assembler = new VectorAssembler()
  .setInputCols( Array(
    "time","nodeID","posX", "posY","speed","period"))
  .setOutputCol("features")


val lr = new LinearRegression()
  .setLabelCol("period")
  .setFeaturesCol("features")
   .setRegParam(0.1)
  .setMaxIter(100)
  .setSolver("l-bfgs")

val steps = 
  Array(assembler, lr)

val pipeline = new Pipeline()
  .setStages(steps)

val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)


val model = pipeline.fit {
  training
}

val holdout = model.transform(test)
holdout.show(20)

val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)


val rm = new RegressionMetrics(prediction.rdd.map{
  x =>  (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")

spark.stop()
  }

}

是错误

  
    

线程“main”中的异常java.lang.IllegalArgumentException:不支持数据类型列时间的TimestampType。         在org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)         在org.apache.spark.ml.Pipeline $$ anonfun $ transformSchema $ 4.apply(Pipeline.scala:184)         在org.apache.spark.ml.Pipeline $$ anonfun $ transformSchema $ 4.apply(Pipeline.scala:184)         在scala.collection.IndexedSeqOptimized $ class.foldl(IndexedSeqOptimized.scala:57)         在scala.collection.IndexedSeqOptimized $ class.foldLeft(IndexedSeqOptimized.scala:66)         at scala.collection.mutable.ArrayOps $ ofRef.foldLeft(ArrayOps.scala:186)         在org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)         在org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)         在org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)         在regressionLinear $ .main(regressionLinear.scala:100)         在regressionLinear.main(regressionLinear.scala)

  

1 个答案:

答案 0 :(得分:0)

VectorAssembler仅接受数字列。必须首先编码其他类型的列。考虑到你申请LinearRegression,无论如何都必须对数据进行编码。

确切的步骤将取决于特定领域的知识:

  • 如果您希望基于时间投射字段的线性趋势首先为数字。
  • 如果您期望某种类型的季节性效果,则可能需要提取单个组件(星期几,一小时/一天,一个月等),并且通常应用StringIndexer +`OneHotEncoder。