我使用Spark流来处理来自Kafka的数据。我想把结果写在一个文件中(在本地)。当我在控制台上打印时,一切正常,我得到了我的结果,但当我尝试将其写入文件时,我收到错误。
我使用library(foreach)
library(data.table)
a <- list(date="2017-01-01",ret=1:5)
b <- list(date="2017-01-02",ret=7:9)
lvl3 <- list(a,b)
lvl2 <- list(lvl3,lvl3)
lvl1 <- list(lvl2,lvl2,lvl2)
o.3 <- foreach(i=1:length(lvl1)) %do% {
o.2 <- foreach(j=1:length(lvl1[[i]])) %do% {
o.1 <- foreach(k=1:length(lvl1[[i]][[j]])) %do% {
as.data.table(lvl1[[i]][[j]][[k]])
}
rbindlist(o.1)
}
rbindlist(o.2)
}
dat.final <- rbindlist(o.3)
来做到这一点,但我收到了这个错误:
PrintWriter
我想我不能在ForeachRDD中使用这样的作家!
这是我的代码:
Exception in thread "main" java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
java.io.PrintWriter
Serialization stack:
- object not serializable (class: java.io.PrintWriter, value: java.io.PrintWriter@20f6f88c)
- field (class: streaming.followProduction$$anonfun$main$1, name: qualityWriter$1, type: class java.io.PrintWriter)
- object (class streaming.followProduction$$anonfun$main$1, <function1>)
- field (class: streaming.followProduction$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class streaming.followProduction$$anonfun$main$1)
- object (class streaming.followProduction$$anonfun$main$1$$anonfun$apply$1, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, <function2>)
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.kafka010.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData,
我正在打这个课:
object followProduction extends Serializable {
def main(args: Array[String]) = {
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.append("dateTime , quality , status \n")
val sparkConf = new SparkConf().setMaster("spark://address:7077").setAppName("followProcess").set("spark.streaming.concurrentJobs", "4")
val sc = new StreamingContext(sparkConf, Seconds(10))
sc.checkpoint("checkpoint")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "address:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> s"${UUID.randomUUID().toString}",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("A", "C")
topics.foreach(t => {
val stream = KafkaUtils.createDirectStream[String, String](
sc,
PreferConsistent,
Subscribe[String, String](Array(t), kafkaParams)
)
stream.foreachRDD(rdd => {
rdd.collect().foreach(i => {
val record = i.value()
val newCsvRecord = process(t, record)
println(newCsvRecord)
qualityWriter.append(newCsvRecord)
})
})
})
qualityWriter.close()
sc.start()
sc.awaitTermination()
}
var componentQuantity: componentQuantity = new componentQuantity("", 0.0, 0.0, 0.0)
var diskQuality: diskQuality = new diskQuality("", 0.0)
def process(topic: String, record: String): String = topic match {
case "A" => componentQuantity.checkQuantity(record)
case "C" => diskQuality.followQuality(record)
}
}
我怎样才能做到这一点?我对Spark和Scala都很陌生,所以也许我做得不对。 谢谢你的时间
编辑:
我已经更改了我的代码,我不再收到此错误。但与此同时,我的文件中只有第一行,并且没有附加记录。内部的writer(handleWriter)实际上不起作用。
这是我的代码:
case class diskQuality(datetime: String, quality: Double) extends Serializable {
def followQuality(record: String): String = {
val dateFormat: SimpleDateFormat = new SimpleDateFormat("dd-mm-yyyy hh:mm:ss")
var recQuality = msgParse(record).quality
var date: Date = dateFormat.parse(msgParse(record).datetime)
var recDateTime = new SimpleDateFormat("dd-mm-yyyy hh:mm:ss").format(date)
// some operations here
return recDateTime + " , " + recQuality
}
def msgParse(value: String): diskQuality = {
import org.json4s._
import org.json4s.native.JsonMethods._
implicit val formats = DefaultFormats
val res = parse(value).extract[diskQuality]
return res
}
}
我在哪里错过了?也许我做错了......
答案 0 :(得分:2)
最简单的方法是在PrintWriter
内创建foreachRDD
的实例,这意味着它不会被函数闭包捕获:
stream.foreachRDD(rdd => {
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.append("dateTime , quality , status \n")
rdd.collect().foreach(i => {
val record = i.value()
val newCsvRecord = process(t, record)
qualityWriter.append(newCsvRecord)
})
})
})
答案 1 :(得分:2)
PrintWriter
是本地资源,绑定到一台计算机,无法序列化。
要从Java序列化计划中删除此对象,我们可以将其声明为@transient
。这意味着followProduction
对象的序列化形式不会尝试序列化该字段。
在问题的代码中,它应声明为:
@transient val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
然后可以在foreachRDD
闭包中使用它。
但是,此过程无法解决与正确处理文件有关的问题。 qualityWriter.close()
将在流作业的第一次传递时执行,文件描述符将在作业执行期间关闭以进行写入。要正确使用本地资源,例如File
,我会按照Yuval建议在foreachRDD
闭包内重新创建PrintWriter。缺失的部分是在附加模式中声明新的PrintWritter
。 foreachRDD
中的修改后的代码将如下所示(进行一些额外的代码更改):
// Initialization phase
val qualityWriter = new PrintWriter(new File("diskQuality.txt"))
qualityWriter.println("dateTime , quality , status")
qualityWriter.close()
....
dstream.foreachRDD{ rdd =>
val data = rdd.map(e => e.value())
.collect() // get the data locally
.map(i=> process(topic , i)) // create csv records
val allRecords = data.mkString("\n") // why do I/O if we can do in-mem?
val handleWriter = new PrintWriter(file, append=true)
handleWriter.append(allRecords)
handleWriter.close()
}
关于问题代码的几点说明:
“spark.streaming.concurrentJobs”,“4”
这会产生多个线程写入同一本地文件的问题。在这种情况下,它可能也被误用了。
sc.checkpoint( “检查点”)
似乎没有必要对这项工作进行检查。