我使用带有scala的spark在TimestampType
类型中有问题 object regressionLinear {
case class X(
time:String,nodeID: Int, posX: Double,posY: Double,
speed: Double,period: Int)
def main (args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
/**
* Read the input data
*/
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
dataset = args(0)
}
val spark = SparkSession
.builder
.appName("regressionsol")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile(dataset)
.map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))
val assembler = new VectorAssembler()
.setInputCols( Array(
"time","nodeID","posX", "posY","speed","period"))
.setOutputCol("features")
val lr = new LinearRegression()
.setLabelCol("period")
.setFeaturesCol("features")
.setRegParam(0.1)
.setMaxIter(100)
.setSolver("l-bfgs")
val steps =
Array(assembler, lr)
val pipeline = new Pipeline()
.setStages(steps)
val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
val model = pipeline.fit {
training
}
val holdout = model.transform(test)
holdout.show(20)
val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)
val rm = new RegressionMetrics(prediction.rdd.map{
x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")
spark.stop()
}
}
是错误
线程“main”中的异常java.lang.IllegalArgumentException:不支持数据类型列时间的TimestampType。 在org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124) 在org.apache.spark.ml.Pipeline $$ anonfun $ transformSchema $ 4.apply(Pipeline.scala:184) 在org.apache.spark.ml.Pipeline $$ anonfun $ transformSchema $ 4.apply(Pipeline.scala:184) 在scala.collection.IndexedSeqOptimized $ class.foldl(IndexedSeqOptimized.scala:57) 在scala.collection.IndexedSeqOptimized $ class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps $ ofRef.foldLeft(ArrayOps.scala:186) 在org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184) 在org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) 在org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136) 在regressionLinear $ .main(regressionLinear.scala:100) 在regressionLinear.main(regressionLinear.scala)
答案 0 :(得分:0)
VectorAssembler
仅接受数字列。必须首先编码其他类型的列。考虑到你申请LinearRegression
,无论如何都必须对数据进行编码。
确切的步骤将取决于特定领域的知识:
StringIndexer
+`OneHotEncoder。