Spark:mapWithSate之后的输出操作

时间:2017-03-16 04:23:17

标签: apache-spark spark-streaming

我们有火花流应用程序,我们消耗来自Kafka的事件..我们希望在每个事件中通过traceid聚合事件一段时间并为该traceid创建聚合事件并编写聚合事件进入数据库

我们的活动就像这样

traceid: 123
{
  info: abc;
}

traceid: 123
{
  info:bcd;
}

现在我们想要实现的是在一段时间内创建一个聚合事件,比如说2分钟,然后将聚合事件写入数据库而不是单个事件

traceid: 123
{
   info:abc,bcd
}

我们使用 mapwithState 并想出了这段代码

 def trackStateFunc(batchTime: Time, id: String, url: Option[MetricTypes.EnrichedKeyType], state: State[SessionData]): Option[(String, String, Long, immutable.Map[String, String])] = {

    val enrichedId = id

    var accountId:String = null
    var reducedText:String = null
    var commonIDS:String = null
    var deviceId:String = null
    var ets:Long = 0
    var eventId:String = null

    if (url.isDefined) {
      accountId = url.get._1.asInstanceOf[String]
      reducedText = url.get._2.asInstanceOf[String]
      commonIDS = url.get._3.asInstanceOf[String]
      deviceId = url.get._4.asInstanceOf[String]
      ets = url.get._5.toString.toLong
      eventId = url.get._6.asInstanceOf[String]

      val attributeMap = Map(
        eventId -> reducedText,
        "common_ids" -> commonIDS,
        "common_enriched_physicalDeviceId" -> deviceId
      )

      if (state.exists) {
        val newState = state.get.attributeMap ++ attributeMap
        state.update(SessionData(newState))
        Some(accountId, enrichedId, ets, newState)
      } else {
        state.update(SessionData(attributeMap))
        Some(accountId, enrichedId, ets, attributeMap)
      }
    }
    else {
      None
    }

  }

    val stateSpec = StateSpec.function(trackStateFunc _).timeout(Minutes(2)).


    val requestsWithState = tempLines.mapWithState(stateSpec)


    requestsWithState.foreachRDD { rdd =>
        rdd.foreachPartition { partitionOfRecords =>
          val connection = createNewConnection()
          partitionOfRecords.foreach(record => { record match {


        case (accountId, enrichedId, ets, attributeMap) =>

          if (validateRecordForStorage(accountId, enrichedId, ets, attributeMap)) {
            val ds = new DBDataStore(connection)
            ds.saveEnrichedEvent(accountId, enrichedId, ets, attributeMap)
            //val r = scala.util.Random

          } else {

            /*logError("Discarded record [enrichedId=" + enrichedId
                + ", accountId=" + accountId
                + ", ets=" + ets
                + ", attributes=" + attributeMap.toString() + "]")*/
            println("Discarded record [enrichedId=" + enrichedId
              + ", accountId=" + accountId
              + ", ets=" + ets
              + "]")

            null
          }

        case default => {
          logInfo("You gave me: " + default)
          null
        }
      }

      }

      )

        }
    }

mapwithState聚合很好......但是我们的理解是..它应该在2分钟之后才开始写入数据库 但是注意到开始&#39 ; s立即写入数据库而不等待2分钟 .....所以我们的理解是不正确的,如果有人可以指导我们实现我们的目标,只有在聚合2分钟后才能写入数据库 < / strong>会有很大的帮助

0 个答案:

没有答案