我在Spark Streaming(Scala)中从Kafka接收Json字符串。每个字符串的处理需要一些时间,所以我想通过X集群分发处理。
目前我只是在笔记本电脑上进行测试。因此,为简单起见,我们假设我应该应用于每个Json字符串的处理只是字段的一些规范化:
def normalize(json: String): String = {
val parsedJson = Json.parse(json)
val parsedRecord = (parsedJson \ "records")(0)
val idField = parsedRecord \ "identifier"
val titleField = parsedRecord \ "title"
val output = Json.obj(
"id" -> Json.parse(idField.get.toString().replace('/', '-')),
"publicationTitle" -> titleField.get
)
output.toString()
}
这是我尝试将操作normalize
分布在“群集”上(每个Json字符串应该完全处理; Json字符串不能被分割)。如何在Task not serializable
行处理问题val callRDD = JSONstrings.map(normalize(_))
?
val conf = new SparkConf().setAppName("My Spark Job").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
val topicMap = topic.split(",").map((_, numThreads)).toMap
val JSONstrings = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val callRDD = JSONstrings.map(normalize(_))
ssc.start()
ssc.awaitTermination()
更新
这是完整的代码:
package org.consumer.kafka
import java.util.Properties
import java.util.concurrent._
import com.typesafe.config.ConfigFactory
import kafka.consumer.{Consumer, ConsumerConfig}
import kafka.utils.Logging
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import play.api.libs.json.{JsObject, JsString, JsValue, Json}
import scalaj.http.{Http, HttpResponse}
class KafkaJsonConsumer(val datasource: String,
val apiURL: String,
val zkQuorum: String,
val group: String,
val topic: String) extends Logging
{
val delay = 1000
val config = createConsumerConfig(zkQuorum, group)
val consumer = Consumer.create(config)
var executor: ExecutorService = null
def shutdown() = {
if (consumer != null)
consumer.shutdown();
if (executor != null)
executor.shutdown();
}
def createConsumerConfig(zkQuorum: String, group: String): ConsumerConfig = {
val props = new Properties()
props.put("zookeeper.connect", zkQuorum);
props.put("group.id", group);
props.put("auto.offset.reset", "largest");
props.put("zookeeper.session.timeout.ms", "2000");
props.put("zookeeper.sync.time.ms", "200");
props.put("auto.commit.interval.ms", "1000");
val config = new ConsumerConfig(props)
config
}
def run(numThreads: Int) = {
val conf = new SparkConf()
.setAppName("TEST")
.setMaster("local[*]")
//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("checkpoint")
val topicMap = topic.split(",").map((_, numThreads)).toMap
val rawdata = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val parsed = rawdata.map(Json.parse(_))
val result = parsed.map(record => {
val parsedRecord = (record \ "records")(0)
val idField = parsedRecord \ "identifier"
val titleField = parsedRecord \ "title"
val journalTitleField = parsedRecord \ "publicationName"
Json.obj(
"id" -> Json.parse(idField.get.toString().replace('/', '-')),
"publicationTitle" -> titleField.get,
"journalTitle" -> journalTitleField.get)
})
result.print
val callRDD = result.map(JsonUtils.normalize(_))
callRDD.print()
ssc.start()
ssc.awaitTermination()
}
object JsonUtils {
def normalize(json: JsValue): String = {
(json \ "id").as[JsString].value
}
}
}
我按如下方式启动此calss KafkaJsonConsumer
的执行:
package org.consumer
import org.consumer.kafka.KafkaJsonConsumer
object TestConsumer {
def main(args: Array[String]) {
if (args.length < 6) {
System.exit(1)
}
val Array(datasource, apiURL, zkQuorum, group, topic, numThreads) = args
val processor = new KafkaJsonConsumer(datasource, apiURL, zkQuorum, group, topic)
processor.run(numThreads.toInt)
//processor.shutdown()
}
}
答案 0 :(得分:1)
看起来normalize
方法是某个类的一部分。在您在map
操作中使用它的行中,Spark不仅需要序列化方法本身,还需要序列化整个实例。最简单的解决方案是将normalize
移动到某个单例对象:
object JsonUtils {
def normalize(json: String): String = ???
}
并调用:
val callRDD = JSONstrings.map(JsonUtils.normalize(_))