我想使用Spark Structured Streaming在X分钟内检测连续模式。我想知道我是否可以使用GroupState超时来形成时间窗口。 我想要做的是,一旦我检测到对象中第一次出现模式(EntityMetric),检查在X分钟内流中的所有后续EntityMetric对象是否连续出现此模式。 X分钟过后,返回一个Alert对象。
这是我的代码,但不知怎的,我从未看到state.hasTimedOut()变为true。我想知道我在这里错过了什么?任何帮助深表感谢。谢谢!
Dataset<EntityMetric> ems = spark
.readStream()...
KeyValueGroupedDataset<Integer, EntityMetric> groupedEm = ems.groupByKey((MapFunction<EntityMetric, Integer>) m -> m.<Integer>getEntityId(), Encoders.INT());
MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert> continuousViolationsFunc = new MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert>() {
@Override
public Alert call(Integer entityId, Iterator<EntityMetric> events, GroupState<Alert> state)
throws Exception {
Alert currentAlert = null;
Alert newAlert = null;
…
…
if (state.hasTimedOut()) {
// How come state.hasTimedOut() is never true?
state.remove();
} else if(state.exists()) {
currentAlert = state.get();
while (events.hasNext()) {
EntityMetric e = events.next();
// Pattern matching logic that instantiates and populates newAlert…
}
if(newAlert != null) {
state.update(newAlert);
}
} else {
boolean startTimer = false;
// For the first occurrence…
while (events.hasNext()) {
EntityMetric e = events.next();
// Pattern matching logic that set startTimer to true…
}
if(startTimer) {
state.update(newAlert);
state.setTimeoutDuration("1 minutes");
}
}
return newAlert;
}
};
Dataset<Alert> alerts = groupedEm.mapGroupsWithState(
continuousViolationsFunc,
Encoders.bean(Alert.class),
Encoders.bean(Alert.class),
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = alerts
.writeStream()
.format("console")
.outputMode(OutputMode.Update())
.start();