Spark Streaming加入Kafka主题比较

时间:2019-01-07 08:39:46

标签: apache-spark spark-streaming

我们需要考虑到较晚的数据或“不在联接中”来实现关于Kafka主题的联接,这意味着不会丢失/丢失流中迟到或未联接的数据,但将其标记为超时,< / p>

产生联接的结果以输出Kafka主题(如果发生超时,则归档)。

(独立部署中的spark 2.1.1,Kafka 10)

Kafka主题:X,Y,...出来的主题结果如下:

{
    "keyJoinFiled": 123456,
    "xTopicData": {},
    "yTopicData": {},
    "isTimeOutFlag": true
}

我在这里找到了三种解决方案,分别来自Spark Streaming官方文档中的1和2,但与我们无关(数据未加入Dtsream,迟于“营业时间”,丢失/丢失了),但我将它们编写以进行比较

从我们看到的结果来看,关于有状态操作的Kafka连接主题的示例并不太多,请在此处添加一些代码以供查看:

1)根据Spark Streaming文档,

https://spark.apache.org/docs/2.1.1/streaming-programming-guide.html:   
 val stream1: DStream[String, String] = 
 val stream2: DStream[String, String] = 
 val joinedStream = stream1.join(stream2)

这将从两个流批处理持续时间中加入数据,但迟到/未加入的“业务时间”数据将被删除/丢失。

2)窗口连接:

val leftWindowDF = kafkaStreamLeft.window(Minutes(input_parameter_time))
val rightWindowDF = kafkaStreamRight.window(Minutes(input_parameter_time))
leftWindowDF.join(rightWindowDF).foreachRDD...

2.1)在我们的例子中,我们需要考虑使用Tumbling窗口        火花流批处理间隔。   2.2)需要在内存/磁盘中保存大量数据,例如30-60分钟        窗口   2.3)数据再次延迟到达/不在窗口中/不在联接中        掉落/丢失。        *从spark 2.3.1到流连接的结构化流是          支持,但遇到无法清除HDFS状态的错误          结果,存储因OOM每隔几个小时下降一次,          在2.4中解决         ,https://issues.apache.org/jira/browse/SPARK-23682         (使用Rocksdb或CustomStateStoreProvider HDFS状态存储)。

3)使用有状态操作mapWithState来加入Kafka主题Dstreams    具有翻滚窗口和30分钟超时的最新数据,    为产生主题而产生的所有数据均包含来自所有    如果发生联接,则显示主题;如果没有,则显示主题数据的一部分    加入发生在30分钟内(用is_time_out标志标记)

3.1)为每个主题创建1..n Dstream,转换为键值/非联合      带有连接的记录作为关键和翻滚窗口。      创建一个包罗万象的方案。  3.2)联合所有流  3.3)在带有功能的联合流mapWithState上运行-实际上将执行      加入/标记超时。

从数据块(火花2.2.0)进行有状态联接的好例子: https://www.youtube.com/watch?time_continue=1858&v=JAb4FIheP28

添加正在运行/测试的示例代码。

 val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> brokers,
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> groupId,
    "session.timeout.ms" -> "30000"
  )

  //Kafka xTopic DStream
  val kafkaStreamLeft = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](leftTopic.split(",").toSet, kafkaParams)
  ).map(record => {
    val msg:xTopic = gson.fromJson(record.value(),classOf[xTopic])
    Unioned(Some(msg),None,if (msg.sessionId!= null) msg.sessionId.toString else "")
  }).window(Minutes(leftWindow),Minutes(leftWindow))

  //Kafka yTopic DStream
  val kafkaStreamRight = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](rightTopic.split(",").toSet, kafkaParams)
  ).map(record => {
    val msg:yTopic = gson.fromJson(record.value(),classOf[yTopic])
    Unioned(None,Some(msg),if (msg.sessionId!= null) msg.sessionId.toString else "")
  }).window(Minutes(rightWindow),Minutes(rightWindow))

  //convert stream to key, value pair and filter empty session id.
  val unionStream = kafkaStreamLeft.union(kafkaStreamRight).map(record =>(record.sessionId,record))
    .filter(record => !record._1.toString.isEmpty)
  val stateSpec = StateSpec.function(stateUpdateF).timeout(Minutes(timeout.toInt))

  unionStream.mapWithState(stateSpec).foreachRDD(rdd => {
    try{
      if(!rdd.isEmpty()) rdd.foreachPartition(partition =>{
        val props = new util.HashMap[String, Object]()
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

        val producer = new KafkaProducer[String, String](props)
        //send to kafka result JSON.
        partition.foreach(record => {
          if(record!=null && !"".equals(record) && !"()".equals(record.toString) && !"None".equals(record.toString) ){
            producer.send(new ProducerRecord[String, String](outTopic, null, gson.toJson(record)))
          }
        })
        producer.close()
      })
    }catch {
      case e: Exception  => {
        logger.error(s""""error join topics :${leftTopic} ${rightTopic} to out topic ${outTopic}""")
        logger.info(e.printStackTrace())
      }
    }})

//mapWithState function that will be called on each key occurrence with new items in newItemValues and state items if exits.

def stateUpdateF = (keySessionId:String,newItemValues:Option[Unioned],state:State[Unioned])=> {
    val currentState = state.getOption().getOrElse(Unioned(None,None,keySessionId))

    val newVal:Unioned = newItemValues match {
      case Some(newItemValue) => {
        if (newItemValue.yTopic.isDefined)
          Unioned(if(newItemValue.xTopic.isDefined) newItemValue.xTopic else currentState.xTopic,newItemValue.yTopic,keySessionId)
        else if (newItemValue.xTopic.isDefined)
          Unioned(newItemValue.xTopic, if(currentState.yTopic.isDefined)currentState.yTopic else newItemValue.yTopic,keySessionId)
        else newItemValue
      }
      case _ => currentState //if None = timeout => currentState
    }

    val processTs = LocalDateTime.now()
    val processDate = dtf.format(processTs)
    if(newVal.xTopic.isDefined && newVal.yTopic.isDefined){//if we have a join remove from state
      state.remove()
      JoinState(newVal.sessionId,newVal.xTopic,newVal.yTopic,false,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
    }else if(state.isTimingOut()){//time out do no try to remove state manually ,it's removed automatically.
        JoinState(newVal.sessionId, newVal.xTopic, newVal.yTopic,true,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
    }else{
      state.update(newVal)
    }
  }

  //case class for kafka topics data.(x,y topics ) join will be on session id filed.
  case class xTopic(sessionId:String,param1:String,param2:String,sessionCreationDate:String)
  case class yTopic(sessionId:Long,clientTimestamp:String)
  //catch all schema : object that contains both kafka input fileds topics and key valiue for join.
  case class Unioned(xTopic:Option[xTopic],yTopic:Option[yTopic],sessionId:String)
  //class for  output result of join stateful function.
  case class JoinState(sessionId:String, xTopic:Option[xTopic],yTopic:Option[yTopic],isTimeOut:Boolean,processTs:Long,processDate:String)

我很乐意进行一些评论。 抱歉,很长的帖子。

1 个答案:

答案 0 :(得分:1)

给我留下这种用例是由Sessionization API解决的印象吗?:

StructuredSessionization.scala

还有Stateful Operations in Structured Streaming

还是我错过了什么?