我有一个简单的程序,使用flink CEP库来检测来自日志记录文件的多次失败登录。我的应用程序使用事件时间,我正在登录用户'上做一个keyBy。
当我将StreamExecutionEnvironment并行度设置为1时,程序运行正常。当并行性是其他任何东西时,它会失败。我无法理解为什么。
我可以看到与特定用户相关的所有记录都转到同一个线程,为什么会出现这个问题。另请注意,记录在很多情况下并非按事件时间顺序排列(不确定是否存在问题)但我无法在api中找到任何内容,让我按窗口中的事件时间对记录进行排序。
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(1000);
env.setParallelism(1); //tried with 1 & 4
.....
DataStream<LogEvent> inputLogEventStream = env
.readFile(format, FILEPATH, FileProcessingMode.PROCESS_CONTINUOUSLY, 1000)
.map(new MapToLogEvents())
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<LogEvent>(Time.seconds(0)) {
public long extractTimestamp(LogEvent element) {
return element.getTimeLong();
}
})
.keyBy(new KeySelector<LogEvent, String>() {
public String getKey(LogEvent le) throws Exception {
return le.getUser();
}
});
inputLogEventStream.print();
Pattern<LogEvent, ?> mflPattern = Pattern.<LogEvent> begin("mfl")
.subtype(LogEvent.class).where(
new SimpleCondition<LogEvent>() {
public boolean filter(LogEvent logEvent) {
if (logEvent.getResult().equalsIgnoreCase("failed")) { return true; }
return false;
}
})
.timesOrMore(3).within(Time.seconds(60));
PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream, mflPattern);
DataStream<Threat> outputMflStream = mflPatternStream.select(
new PatternSelectFunction<LogEvent, Threat>() {
public Threat select(Map<String, List<LogEvent>> logEventsMap) {
return new Threat("MULTIPLE FAILED LOGINS detected!");
}
});
outputMflStream.print();
以下再现如下的打印输出:
parallelism = 1(检测到模式成功)
04/03/2018 12:03:53 Source: Custom File Source(1/1) switched to RUNNING
04/03/2018 12:03:53 SelectCepOperator -> Sink: Unnamed(1/1) switched to RUNNING
04/03/2018 12:03:53 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(1/1) switched to RUNNING
04/03/2018 12:03:53 Sink: Unnamed(1/1) switched to RUNNING
LogEvent [recordType=base18, eventCategory=login, user=paul, machine=laptop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:08Z, timeLong=1522103408000]
LogEvent [recordType=base19, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:03Z, timeLong=1522103403000]
LogEvent [recordType=base20, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:05Z, timeLong=1522103405000]
LogEvent [recordType=base21, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:06Z, timeLong=1522103406000]
**THREAT** ==> MULTIPLE FAILED LOGINS detected!
parallelism = 4(未能检测到模式)
04/03/2018 12:05:33 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(3/4) switched to RUNNING
04/03/2018 12:05:33 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(2/4) switched to RUNNING
04/03/2018 12:05:33 Sink: Unnamed(2/4) switched to RUNNING
04/03/2018 12:05:33 SelectCepOperator -> Sink: Unnamed(2/4) switched to RUNNING
04/03/2018 12:05:33 Sink: Unnamed(3/4) switched to RUNNING
04/03/2018 12:05:33 SelectCepOperator -> Sink: Unnamed(3/4) switched to RUNNING
2> LogEvent [recordType=base18, eventCategory=login, user=paul, machine=laptop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:08Z, timeLong=1522103408000]
3> LogEvent [recordType=base21, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:06Z, timeLong=1522103406000]
3> LogEvent [recordType=base20, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:05Z, timeLong=1522103405000]
3> LogEvent [recordType=base19, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:03Z, timeLong=1522103403000]
答案 0 :(得分:0)
我认为这里发生的事情是不同的分区是获取这些事件,当使用.keyBy()处理CEP非常重要时。
您的代码
PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream, mflPattern);
我相信应该
PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream.keyBy("eventCategory","user"), mflPattern);
您可能需要查看https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams