程序在Spark 2.1.1上运行,但在2.0.2上运行

时间:2017-09-25 12:30:55

标签: scala apache-spark apache-spark-mllib

我有一个非常简单的集群程序,我使用Spark 2.1.1在IntelliJ Idea上开发。但是当我在我的集​​群上使用spark 2.0.2启动.jar时,会出现以下错误:

17/09/25 14:23:11 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 7)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (vector) => vector)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:106)
        at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:98)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Do not support vector type class org.apache.spark.mllib.linalg.SparseVector
        at org.apache.spark.mllib.feature.StandardScalerModel.transform(StandardScaler.scala:160)
        at org.apache.spark.ml.feature.StandardScalerModel$$anonfun$2.apply(StandardScaler.scala:167)
        at org.apache.spark.ml.feature.StandardScalerModel$$anonfun$2.apply(StandardScaler.scala:167)
        ... 37 more

这是我的代码:

def main(args:Array[String]): Unit = {

    val spark = SparkSession.builder.config("spark.eventLog.enabled", "true").config("spark.eventLog.dir", "").appName("S1").getOrCreate()
    val df = spark.read.format("csv").option("header", true).csv("petitexport.csv")

    var dff = df.drop("numeroCarte")
    dff.cache()
    for(field <- dff.schema.fields)
    {
      dff = dff.withColumn(field.name, dff(field.name).cast(DoubleType))
    }
    val featureCols = Array("NB de trx","NB de trx RD","Somme RD","Somme refus","NB Pays visite","NB trx nocturnes")
    val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
    val dff2 = assembler.transform(dff)
    val scaler = new StandardScaler().setWithStd(true).setWithMean(true).setInputCol("features").setOutputCol("scaledFeatures")
    val scalerModel = scaler.fit(dff2)
    val scaledData2 = scalerModel.transform(dff2)
    scaledData2.cache

    val kmeans = new KMeans().setK(5).setMaxIter(10).setTol(0.001).setSeed(200).setFeaturesCol("scaledFeatures")
    val model = kmeans.fit(scaledData2)
    val predictions = model.transform(scaledData2)
    predictions.show

是否可以修复此问题以使其在Spark 2.0.2上运行?我理解它是关于SparseVector的,但我并没有真正看到解决方案。

0 个答案:

没有答案