Flink CEP Pattern与开始工作后的第一个事件不匹配,并始终匹配先前设置的事件

时间:2018-02-08 21:16:34

标签: java apache-flink flink-streaming flink-cep

我希望将Flink 1.4.0 Streaming中的CEP模式与以下代码匹配:

    DataStream<Event> input = inputFromSocket.map(new IncomingMessageProcessor()).filter(new FilterEmptyAndInvalidEvents());

    DataStream<Event> inputFiltered = input.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
    KeyedStream<Event, String> partitionedInput = inputFiltered.keyBy(new MyKeySelector());

    Pattern<Event, ?> pattern = Pattern.<Event>begin("start")
    .where(new ActionCondition("action1"))
    .followedBy("middle").where(new ActionCondition("action2"))
    .followedBy("end").where(new ActionCondition("action3"));

    pattern = pattern.within(Time.seconds(30));

    PatternStream<Event> patternStream = CEP.pattern(partitionedInput, pattern);

Event只是一个POJO

public class Event {
    private UUID id;
    private String action;
    private String senderID;
    private long occurrenceTimeStamp;
    ......
}

从我的自定义来源(Google PubSub)中提取。 第一个过滤器FilterEmptyAndInvalidEvents()只过滤了格式不正确的事件等,但在这种情况下不会发生这种情况。我可以通过日志记录输出来验证这一点。 因此,每个事件都通过MyKeySelector.getKey()方法运行。

BoundedOutOfOrdneressGenerator只从一个字段中提取时间戳:

public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<Event> {
    private static Logger LOG = LoggerFactory.getLogger(BoundedOutOfOrdernessGenerator.class);
    private final long maxOutOfOrderness = 5500; // 5.5 seconds

    private long currentMaxTimestamp;

    @Override
    public long extractTimestamp(Event element, long previousElementTimestamp) {
        long timestamp = element.getOccurrenceTimeStamp();
        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
        return timestamp;
    }

    @Override
    public Watermark getCurrentWatermark() {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        Watermark newWatermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
        return newWatermark;
    }
}

MyKeySelector只是从字段中提取字符串值:

public class MyKeySelector implements KeySelector<Event, String> {
    private static Logger LOG = LoggerFactory.getLogger(MyKeySelector.class);

    @Override
    public String getKey(Event value) throws Exception {
        String senderID = value.getSenderID();
        LOG.info("Partioning event {} by key {}", value, senderID);
        return senderID;
    }
}

ActionCondition这里只是对事件中的一个字段进行比较,看起来像这样:

public class ActionCondition extends SimpleCondition<Event> {
    private static Logger LOG = LoggerFactory.getLogger(ActionCondition.class);

    private String filterForCommand = "";

    public ActionCondition(String filterForCommand) {
        this.filterForCommand = filterForCommand;
    }

    @Override
    public boolean filter(Event value) throws Exception {
        LOG.info("Filtering event for {} action: {}", filterForCommand, value);

        if (value == null) {
            return false;
        }

        if (value.getAction() == null) {
            return false;
        }

        if (value.getAction().equals(filterForCommand)) {
            LOG.info("It's a hit for the {} action for event {}", filterForCommand, value);
            return true;
        } else {
            LOG.info("It's a miss for the {} action for event {}", filterForCommand, value);
            return false;
        }
    }
}

不幸的是,当启动作业并发送应该与模式匹配的事件时,它们会被正确接收和分区,但CEP模式不匹配。

举个例子,我发送了以下事件:

  1. 动作1
  2. 1动作
  3. ACTION3
  4. 在Flink作业的日志输出中,我看到事件正在通过MyKeySelector.getKey()方法正确运行,因为我在那里添加了日志记录输出。 因此事件似乎在流中正确显示,但不幸的是它们与模式不匹配。

    日志记录输出如下所示:

    FilterEmptyAndInvalidEvents   - Letting event Event::27ef8d25-8c3b-43fc-a228-fa0dda8e564d --- action: start, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448701 through
    MyKeySelector  - Partioning event Event::27ef8d25-8c3b-43fc-a228-fa0dda8e564d --- action: start, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448701 by key RHHLWUi8sXH33AJIAAAA
    FilterEmptyAndInvalidEvents   - Letting event Event::18b45a9c-b837-4b61-acf3-0b545a097203 --- action: click, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448702 through
    MyKeySelector  - Partioning event Event::18b45a9c-b837-4b61-acf3-0b545a097203 --- action: click, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448702 by key RHHLWUi8sXH33AJIAAAA
    FilterEmptyAndInvalidEvents   - Letting event Event::fe1486ab-d702-421d-be32-98dd38a1d306 --- action: connect, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448703 through
    MyKeySelector  - Partioning event Event::fe1486ab-d702-421d-be32-98dd38a1d306 --- action: connect, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448703 by key RHHLWUi8sXH33AJIAAAA
    MyKeySelector  - Partioning event Event::27ef8d25-8c3b-43fc-a228-fa0dda8e564d --- action: start, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448701 by key RHHLWUi8sXH33AJIAAAA
    MyKeySelector  - Partioning event Event::18b45a9c-b837-4b61-acf3-0b545a097203 --- action: click, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448702 by key RHHLWUi8sXH33AJIAAAA
    MyKeySelector  - Partioning event Event::fe1486ab-d702-421d-be32-98dd38a1d306 --- action: connect, sender: RHHLWUi8sXH33AJIAAAA, timestamp: 1518194448703 by key RHHLWUi8sXH33AJIAAAA
    

    TimeCharacteristic通过

    设置为EventTime
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    

    并且事件包含正确的时间戳。

    如果我现在用动作发送另外3个事件(但是有新的时间戳等)

    1. 动作1
    2. 1动作
    3. ACTION3
    4. 该模式与第一个事件集匹配。 我知道它与第一组事件匹配,因为我用于调试目的用guid标记每个事件,并打印出匹配的事件。

      当发送这3个事件的第3,第4,......组时,总是先前的事件集匹配。 所以似乎有一种&#34;偏移&#34;在模式检测中。它似乎不是一个时间问题,因为如果我在发送它之后等待很长时间(并且看到事件被Flink分区),第一组事件也不匹配。

      我的代码有什么问题,或者为什么flink只会始终与模式中的上一组事件匹配?

1 个答案:

答案 0 :(得分:1)

我做了解决 - 我一直在搜索流媒体源,但我的事件处理实际上是完全正常的。问题是,我的 Watermark 代并没有持续发生。 正如您在上面的代码中所看到的,我只在收到事件时生成了水印

但在发送前3个事件后,在我的设置中 之后没有了。因此,没有新的水印再次生成

由于没有创建时间戳大于序列最后一次接收事件的时间戳的新水印,Flink从未处理过这些元素。可以在此处找到原因:Flink CEP - Handling Lateness in Event Time

重要的一句是:

  

...当水印到达时,此缓冲区中时间戳小于水印时间的所有元素都会被处理。

因为我在BoundedOutOfOrdernessGenerator中生成了一个延迟5.5秒的水印,所以最新的水印总是在最后一个事件的时间戳之前5.5秒。因此,事件从未被处理过。

因此,一个解决方案是定期生成水印,假设事件的特定延迟。为了做到这一点,我们需要为ExecutionConfig设置setAutoWatermarkInterval

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
..
ExecutionConfig executionConfig = env.getConfig();
executionConfig.setAutoWatermarkInterval(1000L);

这使Flink能够在给定时间内定期调用水印生成器(在这种情况下为每秒)并拉出新的水印。

此外,我们需要调整时间戳/水印生成器,以便即使没有新事件流入也会发出新的时间戳。为此,我操纵了Flink附带的BoundedOutOfOrdernessTimestampExtractor.java

public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<Event> {

    private static final long serialVersionUID = 1L;

    /** The current maximum timestamp seen so far. */
    private long currentMaxTimestamp;

    /** The timestamp of the last emitted watermark. */
    private long lastEmittedWatermark = Long.MIN_VALUE;

    /**
     * The (fixed) interval between the maximum seen timestamp seen in the records
     * and that of the watermark to be emitted.
     */
    private final long maxOutOfOrderness;

    public BoundedOutOfOrdernessGenerator() {
        Time maxOutOfOrderness = Time.seconds(5);

        if (maxOutOfOrderness.toMilliseconds() < 0) {
            throw new RuntimeException("Tried to set the maximum allowed " + "lateness to " + maxOutOfOrderness
                    + ". This parameter cannot be negative.");
        }
        this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
        this.currentMaxTimestamp = Long.MIN_VALUE + this.maxOutOfOrderness;
    }

    public long getMaxOutOfOrdernessInMillis() {
        return maxOutOfOrderness;
    }

    /**
     * Extracts the timestamp from the given element.
     *
     * @param element The element that the timestamp is extracted from.
     * @return The new timestamp.
     */
    public long extractTimestamp(Event element) {
        long timestamp = element.getOccurrenceTimeStamp();
        return timestamp;
    }

    @Override
    public final Watermark getCurrentWatermark() {
        Instant instant = Instant.now();
        long nowTimestampMillis = instant.toEpochMilli();
        long latenessTimestamp = nowTimestampMillis - maxOutOfOrderness;

        if (latenessTimestamp >= currentMaxTimestamp) {
            currentMaxTimestamp = latenessTimestamp;
        }

        // this guarantees that the watermark never goes backwards.
        long potentialWM = currentMaxTimestamp - maxOutOfOrderness;
        if (potentialWM >= lastEmittedWatermark) {
            lastEmittedWatermark = potentialWM;
        }
        return new Watermark(lastEmittedWatermark);
    }

    @Override
    public final long extractTimestamp(Event element, long previousElementTimestamp) {
        long timestamp = extractTimestamp(element);
        if (timestamp > currentMaxTimestamp) {
            currentMaxTimestamp = timestamp;
        }
        return timestamp;
    }
}

正如您在getCurrentWatermark()中所看到的,我采用当前的纪元时间戳,减去我们预期的最​​大延迟,然后从此时间戳创建水印。

总之,Flink现在每秒都会获得一个新的时间戳,而Watermark总是“落后”5秒。这允许事件在收到最后一个事件后的最多5秒内与定义的模式匹配。

如果适用于您的场景,则取决于您的场景,因为这也意味着Flink收到的时间超过5秒(比水印小5秒)的事件将被丢弃,不再处理