访问DStreams的集合

时间:2017-07-04 12:02:38

标签: scala apache-spark apache-kafka spark-streaming

我正在尝试访问像此问题的解决方案中获得的已过滤DStream的集合:Spark Streaming - Best way to Split Input Stream based on filter Param

我按如下方式创建Collection:

val statuCodes = Set("200","500", "404")
    spanTagStream.cache()
    val statusCodeStreams = statuCodes.map(key => key -> spanTagStream.filter(x => x._3.get("http.status_code").getOrElse("").asInstanceOf[String].equals(key)))

我尝试通过以下方式访问statusCodeStreams

for(streamTuple <- statusCodeStreams){
      streamTuple._2.foreachRDD(rdd =>
  rdd.foreachPartition(
      partitionOfRecords =>
        {
            val props = new HashMap[String, Object]()
            props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers)
            props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
              "org.apache.kafka.common.serialization.StringSerializer")
            props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
              "org.apache.kafka.common.serialization.StringSerializer")
            val producer = new KafkaProducer[String,String](props)

            partitionOfRecords.foreach
            {
                 x=>{ 
                 /* Code Writing to Kafka using streamTuple._1 as the topic-String */
                 }
            }
      })
   )
}

执行此操作时,我收到以下错误: java.io.NotSerializableException:Object of org.apache.spark.streaming.kafka010.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects

如何以可序列化的方式访问Streams以写入Kafka?

1 个答案:

答案 0 :(得分:1)

如异常所示,闭包正在捕获DStream定义。 一个简单的选择是声明此DStream瞬态:

@transient val spamTagStream = //KafkaUtils.create...

@transient标记某些对象要从某些对象的对象图的Java序列化中删除。这种情况的关键是在闭包内使用与val(在这种情况下为DStream)相同范围内声明的一些statusCodeStreams。闭包内val的实际引用是outer.statusCodeStreams,导致序列化过程将outer的所有上下文“拉”到闭包中。使用@transient,我们将DStream(以及StreamingContext)声明标记为不可序列化,我们避免了序列化问题。根据代码结构(如果它在一个main函数中都是线性的(不好的做法,顺便说一下),可能需要将 ALL DStream声明+ StreamingContext实例标记为@transient

如果初始过滤的唯一目的是将内容“路由”为单独的Kafka主题,则可能值得在foreachRDD内移动过滤。这将使程序结构更简单。

spamTagStream.foreachRDD{ rdd => 
    rdd.cache()
    statuCodes.map{code =>
        val matchingCodes = rdd.filter(...)
        matchingCodes.foreachPartition{write to kafka}
    }
    rdd.unpersist(true)
}