我有一个小的Scala代码可以在Spark-Shell上正常运行,但在带有Scala插件的Eclipse中没有。我可以使用插件尝试编写另一个文件来访问hdfs并且它有效..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
错误讯息:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
任何人都可以帮助我这个
答案 0 :(得分:0)
尝试将class Methods { .. }
更改为object Methods { .. }
。
我认为问题发生在val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
。当Spark看到filter
和map
函数时,它会尝试序列化传递给它们的函数(x => !m.isHeader(x)
和x=> m.parse(x)
),以便它可以将执行它们的工作分派给所有执行者(这是提到的任务)。但是,要做到这一点,它需要序列化m
,因为这个对象在函数内部被引用(它在两个匿名方法的闭包中) - 但它不能这样做,因为Methods
不是序列化。您可以将extends Serializable
添加到Methods
类,但在这种情况下,object
更合适(并且已经可序列化)。