获取处理延迟事件的前一个窗口值

时间:2018-05-17 18:01:17

标签: apache-flink flink-streaming windowing

我正在寻找一种方法来设置窗口以允许延迟,并让我根据为会话计算的先前值计算值。

我的会话值总体上是唯一标识符,并且永远不会发生冲突,但会话在技术上可以随时进入。在大多数会话中,大多数事件处理超过5分钟,允许1天的迟到应该满足任何迟到的事件。

  stream
    .keyBy { jsonEvent => jsonEvent.findValue("session").toString }
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(5)))
    .allowedLateness(Time.days(1))
    .process { new SessionProcessor }
    .addSink { new HttpSink }

对于每个会话,我找到一个字段的最大值,并检查没有发生一些事件(如果它们确实发生了,它们将使最大值字段为零)。我决定创建ProcessWindowFunction来执行此操作。

Class SessionProcessor extends ProcessWindowFunction[ObjectNode, (String, String, String, Long), String, TimeWindow] {

   override def process(key: String, context: Context, elements: Iterable[ObjectNode], out: Collector[(String, String, String, Long)]): Unit = {
      //Parse and calculate data
      maxValue = if(badEvent1) 0 else maxValue
      maxValue = if(badEvent2) 0 else maxValue          
      out.collect((string1,string2,string3, maxValue))
   }
}

在允许延迟事件之前,此工作正常。当迟到的事件发生时,maxValue会重新计算并再次输出到HttpSink。我正在寻找一种方法,以便我可以计算前一个maxValue和晚maxValue的增量。

我正在寻找一种方法来确定:

  1. 如果对该功能的调用来自延迟事件(我不想重复计算会话总数)
  2. 新数据是什么,或者是否有办法存储以前计算的值。
  3. 对此的任何帮助将不胜感激。

    编辑:用于ValueState的新代码

    KafkaConsumer.scala

    import org.apache.flink.streaming.api.TimeCharacteristic
    import org.apache.flink.streaming.connectors.kafka._
    import org.apache.flink.streaming.util.serialization.JSONDeserializationSchema
    import org.apache.flink.streaming.api.scala._
    import com.fasterxml.jackson.databind.node.ObjectNode
    import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
    import org.apache.flink.streaming.api.windowing.time.Time
    
    
    object KafkaConsumer {
       def main(args: Array[String]) {
          val env = StreamExecutionEnvironment.getExecutionEnvironment
          env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
          val properties = getServerProperties
          val consumer = new FlinkKafkaConsumer010[ObjectNode]("test-topic", new JSONDeserializationSchema, properties)
          consumer.setStartFromLatest()
          val stream = env.addSource(consumer)
    
          stream
            .keyBy { jsonEvent => jsonEvent.findValue("data").findValue("query").findValue("session").toString }
            .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
            .allowedLateness(Time.days(1))
            .process {
              new SessionProcessor
            }
            .print
          env.execute("Kafka APN Consumer")
        }
      }
    

    SessionProcessor.scala

    import org.apache.flink.util.Collector
    import com.fasterxml.jackson.databind.node.ObjectNode
    import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
    import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
    import org.apache.flink.streaming.api.windowing.windows.TimeWindow
    
    class SessionProcessor extends ProcessWindowFunction[ObjectNode, (String, String, String, Long), String, TimeWindow] {
    
      final val previousValue = new ValueStateDescriptor("previousValue", classOf[Long])
    
      override def process(key: String, context: Context, elements: Iterable[ObjectNode], out: Collector[(String, String, String, Long)]): Unit = {
    
        val previousVal: ValueState[Long] = context.windowState.getState(previousValue)
        val pVal: Long = previousVal.value match {
          case i: Long => i
        }
        var session = ""
        var user = ""
        var department = ""
        var lVal: Long = 0
    
        elements.foreach( value => {
          var jVal: String = "0"
          if (value.findValue("data").findValue("query").has("value")) {
            jVal = value.findValue("data").findValue("query").findValue("value").toString replaceAll("\"", "")
          }
          session = value.findValue("data").findValue("query").findValue("session").toString replaceAll("\"", "")
          user = value.findValue("data").findValue("query").findValue("user").toString replaceAll("\"", "")
          department = value.findValue("data").findValue("query").findValue("department").toString replaceAll("\"", "")
          lVal = if (jVal.toLong > lVal) jVal.toLong else lVal
        })
    
        val increaseTime = lVal - pVal
        previousVal.update(increaseTime)
        out.collect((session, user, department, increaseTime))
      }
    }
    

1 个答案:

答案 0 :(得分:2)

这是一个做类似事情的例子。它有希望合理地解释清楚,并且应该足够容易以适应您的需求。

这里的基本思想是你可以使用context.windowState(),这是通过传递给ProcessWindowFunction的上下文可用的每个窗口状态。事实上,windowState仅对多次触发的窗口有用,因为每个新窗口实例都有一个新初始化(和空)的windowState存储。对于在所有窗口共享但仍键入的状态,请使用context.globalState()

private static class DifferentialWindowFunction
  extends ProcessWindowFunction<Long, Tuple2<Long, Long>, String, TimeWindow> {

  private final static ValueStateDescriptor<Long> previousFiringState =
    new ValueStateDescriptor<>("previous-firing", LongSerializer.INSTANCE);

  private final static ReducingStateDescriptor<Long> firingCounterState =
    new ReducingStateDescriptor<>("firing-counter", new Sum(), LongSerializer.INSTANCE);

  @Override
  public void process(
      String key, 
      Context context, 
      Iterable<Long> values, 
      Collector<Tuple2<Long, Long>> out) {

    ValueState<Long> previousFiring = context.windowState().getState(previousFiringState);
    ReducingState<Long> firingCounter = context.windowState().getState(firingCounterState);

    Long output = Iterables.getOnlyElement(values);
    if (firingCounter.get() == null) {
      // first firing
      out.collect(Tuple2.of(0L, output));
    } else {
      // subsequent firing
      out.collect(Tuple2.of(firingCounter.get(), output - previousFiring.value()));    
    } 
    firingCounter.add(1L);
    previousFiring.update(output);
  }

  @Override
  public void clear(Context context) {
    ValueState<Long> previousFiring = context.windowState().getState(previousFiringState);
    ReducingState<Long> firingCounter = context.windowState().getState(firingCounterState);

    previousFiring.clear();
    firingCounter.clear();
  }
}