如何在Spark Structured Streaming中使用GroupState超时来形成时间窗口?

时间:2017-09-20 08:03:57

标签: apache-spark spark-structured-streaming

我想使用Spark Structured Streaming在X分钟内检测连续模式。我想知道我是否可以使用GroupState超时来形成时间窗口。 我想要做的是,一旦我检测到对象中第一次出现模式(EntityMetric),检查在X分钟内流中的所有后续EntityMetric对象是否连续出现此模式。 X分钟过后,返回一个Alert对象。

这是我的代码,但不知怎的,我从未看到state.hasTimedOut()变为true。我想知道我在这里错过了什么?任何帮助深表感谢。谢谢!

Dataset<EntityMetric> ems = spark
                    .readStream()...

KeyValueGroupedDataset<Integer, EntityMetric> groupedEm = ems.groupByKey((MapFunction<EntityMetric, Integer>) m -> m.<Integer>getEntityId(), Encoders.INT());

MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert> continuousViolationsFunc = new MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert>() {
            @Override
            public Alert call(Integer entityId, Iterator<EntityMetric> events, GroupState<Alert> state)
                    throws Exception {
                Alert currentAlert = null;
                Alert newAlert = null;
                …
                …
                if (state.hasTimedOut()) {
                    // How come state.hasTimedOut() is never true?
                    state.remove();
                } else if(state.exists()) {
                    currentAlert = state.get();
                    while (events.hasNext()) {
                        EntityMetric e = events.next();
                        // Pattern matching logic that instantiates and populates newAlert…
                    }
                    if(newAlert != null) {
                        state.update(newAlert);
                    }
                } else {
                    boolean startTimer = false;
                    // For the first occurrence…
                    while (events.hasNext()) {
                        EntityMetric e = events.next();
                        // Pattern matching logic that set startTimer to true…
                    }
                    if(startTimer) {
                        state.update(newAlert);
                        state.setTimeoutDuration("1 minutes");
                    }
                }
                return newAlert;
            }
};

Dataset<Alert> alerts = groupedEm.mapGroupsWithState(
                continuousViolationsFunc, 
                Encoders.bean(Alert.class), 
                Encoders.bean(Alert.class), 
                GroupStateTimeout.ProcessingTimeTimeout());

StreamingQuery query = alerts
            .writeStream()
            .format("console")
            .outputMode(OutputMode.Update())
            .start();

0 个答案:

没有答案