由于Spark 2.0上的scala.MatchError,简单的PySpark回归失败了?

时间:2016-09-13 21:56:57

标签: apache-spark pyspark apache-spark-ml

我尝试对单个分类变量上有两列的数据框执行线性回归,将performance预测为device_class的函数:

df.select('performance','device_class').show(5,False)

+----------------+------------+
|performance     |device_class|
+----------------+------------+
|35              |2           |
|35              |2           |
|35              |2           |
|25              |2           |
|5               |1           |
+----------------+------------+
only showing top 5 rows

df.select('performance','device_class').printSchema()

root
 |-- performance: integer (nullable = true)
 |-- device_class: integer (nullable = true)

from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.feature import *

numEncoder = OneHotEncoder(dropLast=False,inputCol="device_class",outputCol="dev_class_cat")

fAssembler = VectorAssembler(
    inputCols=['dev_class_cat'],
    outputCol='features')

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8,labelCol='performance',featuresCol='features')

pipeline = Pipeline(stages=[numEncoder,fAssembler])

modelTmp = pipeline.fit(df)

modelTmp = pipeline.fit(df)
tmp = modelTmp.transform(df).select('performance','dev_class_cat','features').show(5,False)

+----------------+-------------+-------------+
|performance     |dev_class_cat|features     |
+----------------+-------------+-------------+
|35              |(5,[2],[1.0])|(5,[2],[1.0])|
|35              |(5,[2],[1.0])|(5,[2],[1.0])|
|35              |(5,[2],[1.0])|(5,[2],[1.0])|
|25              |(5,[2],[1.0])|(5,[2],[1.0])|
|5               |(5,[1],[1.0])|(5,[1],[1.0])|
+----------------+-------------+-------------+
only showing top 5 rows

工作正常,到目前为止一切顺利。但是,如果我将回归添加到管道中:

pipeline = Pipeline(stages=[numEncoder,fAssembler,lr])
modelTmp = pipeline.fit(df)

我收到了这个错误:

Py4JJavaError: An error occurred while calling o2503.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 188.0 failed 4 times, most recent failure: Lost task 19.3 in stage 188.0 (TID 375286, rs119.hadoop.pvt): scala.MatchError: [25,1.0,(5,[1],[1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at org.apache.spark.ml.regression.LinearRegression$$anonfun$5.apply(LinearRegression.scala:200)
    at org.apache.spark.ml.regression.LinearRegression$$anonfun$5.apply(LinearRegression.scala:200)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

实际上我收到了很多错误。我发现这个帖子MatchError while accessing vector column in Spark 2.0引发了类似的错误,但我没有使用任何mllib内容。也许稀疏矢量有问题?

0 个答案:

没有答案