运行Windows 8.1,Java 1.8,Scala 2.10.5,Spark 1.4.1,Scala IDE(Eclipse 4.4),Ipython 3.0.0和Jupyter Scala。
我对Scala和Spark相对较新,我发现某些RDD命令如collect和首先返回“Task not serializable”错误的问题。对我来说不寻常的是我在使用Scala内核或Scala IDE的Ipython笔记本中看到了这个错误。但是,当我直接在spark-shell中运行代码时,我没有收到此错误。
我想设置这两个环境,以便在shell之外进行更高级的代码评估。我在解决此类问题并确定要查找的内容方面缺乏专业知识;如果您能提供有关如何开始解决此类问题的指导,我们将非常感激。
代码:
val logFile = "s3n://[key:[key secret]@mortar-example-data/airline-data"
val sample = sc.parallelize(sc.textFile(logFile).take(100).map(line => line.replace("'","").replace("\"","")).map(line => line.substring(0,line.length()-1)))
val header = sample.first
val data = sample.filter(_!= header)
data.take(1)
data.count
data.collect
堆栈跟踪
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)
java.io.NotSerializableException: org.apache.spark.SparkConf
Serialization stack:
- object not serializable (class: org.apache.spark.SparkConf, value: org.apache.spark.SparkConf@5976e363)
- field (class: cmd12$$user, name: conf, type: class org.apache.spark.SparkConf)
- object (class cmd12$$user, cmd12$$user@39a7edac)
- field (class: cmd49, name: $ref$cmd12, type: class cmd12$$user)
- object (class cmd49, cmd49@3c2a0c4f)
- field (class: cmd49$$user, name: $outer, type: class cmd49)
- object (class cmd49$$user, cmd49$$user@774ea026)
- field (class: cmd49$$user$$anonfun$4, name: $outer, type: class cmd49$$user)
- object (class cmd49$$user$$anonfun$4, <function0>)
- field (class: cmd49$$user$$anonfun$4$$anonfun$apply$3, name: $outer, type: class cmd49$$user$$anonfun$4)
- object (class cmd49$$user$$anonfun$4$$anonfun$apply$3, <function1>)
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)
org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
org.apache.spark.rdd.RDD.filter(RDD.scala:310)
cmd49$$user$$anonfun$4.apply(Main.scala:188)
cmd49$$user$$anonfun$4.apply(Main.scala:187)
答案 0 :(得分:1)
@Ashalynd关于sc.textFile已经创建和RDD的事实是正确的。在这种情况下,您不需要sc.parallelize。 documentation here
所以考虑你的例子,这就是你需要做的事情:
// Read your data from S3
val logFile = "s3n://[key:[key secret]@mortar-example-data/airline-data"
val rawRDD = sc.textFile(logFile)
// Fetch the header
val header = rawRDD.first
// Filter on the header than map to clean the line
val sample = rawRDD.filter(!_.contains(header)).map {
line => line.replaceAll("['\"]","").substring(0,line.length()-1)
}.takeSample(false,100,12L) // takeSample returns a fixed-size sampled subset of this RDD in an array
最好使用takeSample
功能:
def takeSample (withReplacement:Boolean,num:Int,seed:Long = Utils.random.nextLong):Array [T]
withReplacement :是否使用替换
进行采样num :返回的样本的大小
种子:随机数生成器的种子
注1:示例是一个数组[String],因此如果您希望将其转换为RDD,可以使用parallelize
函数,如下所示:
val sampleRDD = sc.parallelize(sample.toSeq)
注2:如果您希望直接从rawRDD.filter(...).map(...)
获取样本RDD,可以使用返回RDD [T]的sample
函数。不过,您需要指定所需数据的一小部分而不是特定数字。
答案 1 :(得分:0)
sc.textFile已经创建了分布式数据集(请查看文档)。在这种情况下你不需要sc.parallelize,但是 - 正如eliasah正确指出的那样 - 如果你想拥有一个RDD,你需要再次将结果转换为RDD。
val selection = sc.textFile(logFile). // RDD
take(100). // collection
map(_.replaceAll("['\"]",""). // use regex to match both chars
map(_.init) // a method that returns all elements except the last
// turn the resulting collection into RDD again
val sample = sc.parallelize(selection)