java.lang.IllegalArgumentException: Field "label" does not exist using SparkML

时间:2017-08-05 12:12:25

标签: scala linear-regression apache-spark-ml apache-spark-dataset apache-spark-2.0

I am using Spark with Scala for time series analysis. I have a dataset taken from a Cassandra database that looks like this:

scala> train.printSchema
root
 |-- timestamp: timestamp (nullable = true)
 |-- vx: double (nullable = true)
 |-- speed: double (nullable = true)

I tried Linear Regression as shown like here just to see how it works.

scala> val lr = new LinearRegression().
 |   setMaxIter(10).
 |   setRegParam(0.3).
 |   setElasticNetParam(0.8)
scala> val lrModel = lr.fit(train)

However, I get an error:

java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)

at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predic tor.scala:51) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:82 ) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100) ... 66 elided

It seems that I have to use VectorAssembler to create feature columns containing the predictors,

scala> val assembler = new VectorAssembler().
 |   setInputCols(Array("timestamp","speed")).
 |   setOutputCol("features")
scala> val output = assembler.transform(train)

but it throws error TimestampType is not supported.

java.lang.IllegalArgumentException: Data type TimestampType is not supported. at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.appl y(VectorAssembler.scala:121) at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.appl y(VectorAssembler.scala:117) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler .scala:117) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala :54) ... 66 elided

If I leave out timestamp and use only one column in the VectorAssembler, it again throws an error. See below,

scala> val assembler = new VectorAssembler().
     |   setInputCols(Array("speed")).
     |   setOutputCol("features")
scala> val output = assembler.transform(train)
scala> val lrModel = lr.fit(output)

java.lang.IllegalArgumentException: Field "label" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.sca la:266) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71 ) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predic tor.scala:53) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:82 ) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100) ... 66 elided

I don't know why it says Field "label" does not exist when I input speed alone as predictor. Any help is much appreciated.

1 个答案:

答案 0 :(得分:1)

您需要定义要用作要素和类标签的列/列。如果将多列用作要素,则使用VectorAssembler()是合适的,就像您所做的那样。否则,只需使用setFeaturesCol()方法和列名即可。请注意,此处的输入列必须包含向量,不能是双精度。

对于类标签(它属于哪个类),您可以使用setLabelCol()来定义要使用的列。在您的情况下,由于timestampspeed列是预测变量,我认为vx列是标签。

要使用时间戳,您只需将其转换为Unix纪元时间;

df2 = df.withColumn("unix_time", unix_timestamp(df("timestamp")))

这将为您提供自1970年1月1日以来的秒数。

希望它有所帮助!