任务不可序列化:使用Spark Streaming处理Json字符串

时间:2016-07-20 13:47:44

标签: scala apache-spark spark-streaming

我在Spark Streaming(Scala)中从Kafka接收Json字符串。每个字符串的处理需要一些时间,所以我想通过X集群分发处理。

目前我只是在笔记本电脑上进行测试。因此,为简单起见,我们假设我应该应用于每个Json字符串的处理只是字段的一些规范化:

  def normalize(json: String): String = {
    val parsedJson = Json.parse(json)
    val parsedRecord = (parsedJson \ "records")(0)
    val idField = parsedRecord \ "identifier"
    val titleField = parsedRecord \ "title"

    val output = Json.obj(
      "id" -> Json.parse(idField.get.toString().replace('/', '-')),
      "publicationTitle" -> titleField.get
    )
    output.toString()
  }

这是我尝试将操作normalize分布在“群集”上(每个Json字符串应该完全处理; Json字符串不能被分割)。如何在Task not serializable行处理问题val callRDD = JSONstrings.map(normalize(_))

val conf = new SparkConf().setAppName("My Spark Job").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))

val topicMap = topic.split(",").map((_, numThreads)).toMap

val JSONstrings = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

val callRDD = JSONstrings.map(normalize(_))

ssc.start()
ssc.awaitTermination()

更新

这是完整的代码:

package org.consumer.kafka

import java.util.Properties
import java.util.concurrent._
import com.typesafe.config.ConfigFactory
import kafka.consumer.{Consumer, ConsumerConfig}
import kafka.utils.Logging
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import play.api.libs.json.{JsObject, JsString, JsValue, Json}
import scalaj.http.{Http, HttpResponse}

class KafkaJsonConsumer(val datasource: String,
                        val apiURL: String,
                        val zkQuorum: String,
                        val group: String,
                        val topic: String) extends Logging
{
  val delay = 1000
  val config = createConsumerConfig(zkQuorum, group)
  val consumer = Consumer.create(config)
  var executor: ExecutorService = null

  def shutdown() = {
    if (consumer != null)
      consumer.shutdown();
    if (executor != null)
      executor.shutdown();
  }

  def createConsumerConfig(zkQuorum: String, group: String): ConsumerConfig = {
    val props = new Properties()
    props.put("zookeeper.connect", zkQuorum);
    props.put("group.id", group);
    props.put("auto.offset.reset", "largest");
    props.put("zookeeper.session.timeout.ms", "2000");
    props.put("zookeeper.sync.time.ms", "200");
    props.put("auto.commit.interval.ms", "1000");
    val config = new ConsumerConfig(props)
    config
  }

  def run(numThreads: Int) = {
    val conf = new SparkConf()
                              .setAppName("TEST")
                              .setMaster("local[*]")
                              //.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("checkpoint")

    val topicMap = topic.split(",").map((_, numThreads)).toMap

    val rawdata = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

    val parsed = rawdata.map(Json.parse(_))

    val result = parsed.map(record => {
      val parsedRecord = (record \ "records")(0)
      val idField = parsedRecord \ "identifier"
      val titleField = parsedRecord \ "title"
      val journalTitleField = parsedRecord \ "publicationName"
      Json.obj(
        "id" -> Json.parse(idField.get.toString().replace('/', '-')),
        "publicationTitle" -> titleField.get,
        "journalTitle" -> journalTitleField.get)
    })

    result.print

    val callRDD = result.map(JsonUtils.normalize(_))

    callRDD.print()

    ssc.start()
    ssc.awaitTermination()
  }

  object JsonUtils {
    def normalize(json: JsValue): String = {
      (json \ "id").as[JsString].value
    }
  }

}

我按如下方式启动此calss KafkaJsonConsumer的执行:

package org.consumer

import org.consumer.kafka.KafkaJsonConsumer

object TestConsumer {

  def main(args: Array[String]) {

    if (args.length < 6) {
      System.exit(1)
    }

    val Array(datasource, apiURL, zkQuorum, group, topic, numThreads) = args

    val processor = new KafkaJsonConsumer(datasource, apiURL, zkQuorum, group, topic)
    processor.run(numThreads.toInt)

    //processor.shutdown()

  }

}

1 个答案:

答案 0 :(得分:1)

看起来normalize方法是某个类的一部分。在您在map操作中使用它的行中,Spark不仅需要序列化方法本身,还需要序列化整个实例。最简单的解决方案是将normalize移动到某个单例对象:

object JsonUtils {
  def normalize(json: String): String = ???
}

并调用:

val callRDD = JSONstrings.map(JsonUtils.normalize(_))