在Spark-Shell中工作的代码不在eclipse中

时间:2016-07-20 04:47:43

标签: scala apache-spark eclipse-plugin

我有一个小的Scala代码可以在Spark-Shell上正常运行,但在带有Scala插件的Eclipse中没有。我可以使用插件尝试编写另一个文件来访问hdfs并且它有效..

FirstSpark.scala

package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object FirstSpark {

  def main(args: Array[String])={
    val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
    val sparkcontext = new SparkContext(conf)
    val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
    val m = new Methods()
    val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
    q.saveAsTextFile("hdfs://pranay:8020/output")    } 
}

Methods.scala

package bigdata.spark
import java.util.function.ToDoubleFunction

class Methods {
def isHeader(s:String):Boolean={
    s.contains("id_1")
}
def parse(line:String) ={
    val pieces = line.split(',')
    val id1=pieces(0).toInt
    val id2=pieces(1).toInt
    val matches=pieces(11).toBoolean
    val mapArray=pieces.slice(2, 11).map(toDouble)
    MatchData(id1,id2,mapArray,matches)
  }
def toDouble(s: String) = {
    if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)

错误讯息:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)

任何人都可以帮助我这个

1 个答案:

答案 0 :(得分:0)

尝试将class Methods { .. }更改为object Methods { .. }

我认为问题发生在val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))。当Spark看到filtermap函数时,它会尝试序列化传递给它们的函数(x => !m.isHeader(x)x=> m.parse(x)),以便它可以将执行它们的工作分派给所有执行者(这是提到的任务)。但是,要做到这一点,它需要序列化m,因为这个对象在函数内部被引用(它在两个匿名方法的闭包中) - 但它不能这样做,因为Methods不是序列化。您可以将extends Serializable添加到Methods类,但在这种情况下,object更合适(并且已经可序列化)。