I am using Spark with Scala for time series analysis. I have a dataset taken from a Cassandra database that looks like this:
scala> train.printSchema
root
|-- timestamp: timestamp (nullable = true)
|-- vx: double (nullable = true)
|-- speed: double (nullable = true)
I tried Linear Regression as shown like here just to see how it works.
scala> val lr = new LinearRegression().
| setMaxIter(10).
| setRegParam(0.3).
| setElasticNetParam(0.8)
scala> val lrModel = lr.fit(train)
However, I get an error:
java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predic tor.scala:51) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:82 ) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100) ... 66 elided
It seems that I have to use VectorAssembler
to create feature columns containing the predictors,
scala> val assembler = new VectorAssembler().
| setInputCols(Array("timestamp","speed")).
| setOutputCol("features")
scala> val output = assembler.transform(train)
but it throws error TimestampType is not supported
.
java.lang.IllegalArgumentException: Data type TimestampType is not supported. at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.appl y(VectorAssembler.scala:121) at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.appl y(VectorAssembler.scala:117) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler .scala:117) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala :54) ... 66 elided
If I leave out timestamp and use only one column in the VectorAssembler
, it again throws an error. See below,
scala> val assembler = new VectorAssembler().
| setInputCols(Array("speed")).
| setOutputCol("features")
scala> val output = assembler.transform(train)
scala> val lrModel = lr.fit(output)
java.lang.IllegalArgumentException: Field "label" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71 ) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predic tor.scala:53) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:82 ) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100) ... 66 elided
I don't know why it says Field "label" does not exist
when I input speed
alone as predictor. Any help is much appreciated.
答案 0 :(得分:1)
您需要定义要用作要素和类标签的列/列。如果将多列用作要素,则使用VectorAssembler()
是合适的,就像您所做的那样。否则,只需使用setFeaturesCol()
方法和列名即可。请注意,此处的输入列必须包含向量,不能是双精度。
对于类标签(它属于哪个类),您可以使用setLabelCol()
来定义要使用的列。在您的情况下,由于timestamp
和speed
列是预测变量,我认为vx
列是标签。
要使用时间戳,您只需将其转换为Unix纪元时间;
df2 = df.withColumn("unix_time", unix_timestamp(df("timestamp")))
这将为您提供自1970年1月1日以来的秒数。
希望它有所帮助!