我从UDP套接字获取数据流(nginx在线日志),数据结构为:
date | ip | mac | objectName | rate | size
2016-04-05 11:17:34 | 10.0.0.1 | e1:e2 | book1 | 10 | 121
2016-04-05 11:17:34 | 10.0.0.2 | a5:a8 | book2351 | 8 | 2342
2016-04-05 11:17:34 | 10.0.0.3 | d1:b56| bookA5 | 10 | 12
2016-04-05 11:17:35 | 10.0.0.1 | e1:e2 | book67 | 10 | 768
2016-04-05 11:17:35 | 10.0.0.2 | a5:a8 | book2351 | 8 | 897
2016-04-05 11:17:35 | 10.0.0.3 | d1:b56| bookA5 | 9 | 34
2016-04-05 11:17:35 | 10.0.0.4 | c7:c2 | book99 | 9 | 924
...
2016-04-05 11:18:01 | 10.0.0.1 | e1:e2 | book-10 | 8 | 547547
2016-04-05 11:18:17 | 10.0.0.4 | c7:c2 | book99 | 10 | 23423
2016-04-05 11:18:18 | 10.0.0.3 | d1:b56| bookA5 | 10 | 1138
我必须:
2016-04-05 11:17:34 | 10.0.0.1 | e1:e2
book1
已更改为book67
,因此必须为book1
2016-04-05 11:17:34 | 10.0.0.1 | e1:e2
= ... 768 - 121 所以,结果(没有计算大小):
date | ip | mac | objectName | changes | size
2016-04-05 11:17:00 | 10.0.0.1 | e1:e2 | book1 | 0 | 768 - 121
2016-04-05 11:17:00 | 10.0.0.2 | a5:a8 | book2351 | 0 | 897 - 2342
2016-04-05 11:17:00 | 10.0.0.3 | d1:b56| bookA5 | 1 | 34 - 12
2016-04-05 11:17:00 | 10.0.0.4 | c7:c2 | book99 | 0 | 924
...
2016-04-05 11:18:00 | 10.0.0.1 | e1:e2 | book-10 | 0 | 547547
2016-04-05 11:18:00 | 10.0.0.4 | c7:c2 | book99 | 0 | 23423
2016-04-05 11:18:00 | 10.0.0.3 | d1:b56| bookA5 | 0 | 1138
我的代码快照,我知道updateStateByKey
和window
,但我无法理解,如何将数据刷新到数据库或文件系统, (分钟)改变了:
private static final Duration SLIDE_INTERVAL = Durations.seconds(10);
private static final String nginxLogHost = "localhost";
private static final int nginxLogPort = 9999;
private class Raw {
LocalDate time; // full time with seconds
String ip;
String mac;
String objectName;
int rate;
int size;
}
private class Key {
LocalDate time; // time with 00 seconds
String ip;
String mac;
}
private class RawValue {
LocalDate time; // full time with seconds
String objectName;
int rate;
int size;
}
private class Value {
String objectName;
int changes;
int size;
}
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("TestNginxLog");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
JavaStreamingContext jssc = new JavaStreamingContext(conf, SLIDE_INTERVAL);
jssc.checkpoint("/tmp");
JavaReceiverInputDStream<Raw> logRecords = jssc.receiverStream(new NginxUDPReceiver(nginxLogHost, nginxLogPort));
PairFunction<Raw, Key, RawValue> pairFunction = (PairFunction<Raw, Key, RawValue>) rawLine -> {
LocalDateTime time = rawLine.getDateTime();
Key k = new Key(LocalTime.of(time.getHour(), time.getMinute()), rawLine.getIp(), rawLine.getMac());
RawValue v = new RawValue(time, rawLine.getObjectName(), rawLine.getRate(), rawLine.getSize());
return new Tuple2<>(k, v);
};
JavaPairDStream<Key, RawValue> logDStream = logRecords.mapToPair(pairFunction);
答案 0 :(得分:0)
这是部分回答,但问题尚未完成。 mapToPair
之后我使用:
// 1 key - N values
JavaPairDStream<Key, Iterable<Value>> abonentConnects = logDStream.groupByKey();
// Accumulate data
Function2<List<Iterable<Value>>, Optional<List<Value>>, Optional<List<Value>>> updateFunc = (Function2<List<Iterable<Value>>, Optional<List<Value>>, Optional<List<Value>>>) (values, previousState) -> {
List<Value> sum = previousState.or(new ArrayList<>());
for (Iterable<Value> v : values) {
v.forEach(sum::add);
}
return Optional.of(sum);
};
JavaPairDStream<Key, List<Value>> state = abonentConnects.updateStateByKey(updateFunc);
// filter data (previous minute)
Function<Tuple2<Key, List<Value>>, Boolean> filterFunc = (Function<Tuple2<Key, List<Value>>, Boolean>) v1 -> {
LocalDateTime previousTime = LocalDateTime.now().minusMinutes(1).withSecond(0).withNano(0);
LocalDateTime valueTime = v1._1().getTime();
return valueTime.compareTo(previousTime) == 0;
};
JavaPairDStream<Key, List<Value>> filteredRecords = state.filter(filterFunc);
// save data
filteredRecords.foreachRDD(x -> {
if (x.count() > 0) {
x.saveAsTextFile("/tmp/xxx/grouped/" + LocalDateTime.now().toString().replace(":", "-").replace(".", "-"));
}
});
streamingContext.start();
streamingContext.awaitTermination();
结果数据生成,但由于每5秒执行一次操作,我每隔5秒就会获得相同的重复数据。
我知道,我必须使用Optional.absent()
来清除流式传输中保存的数据。我试图使用它,但我无法在一个片段中组合:将数据保存到文件系统或HashMap立即清除保存的数据
问题:我该怎么做?
答案 1 :(得分:0)
所以,我正在通过自己的答案结束这个问题。您可以将此函数示例用作closeAccount
的参数。此代码中的线索字词为:updateStateByKey
以消除已保存的数据,Optional.absent()
以对数据进行分组并Optional.of(...
。
最后一个用于通过过滤器setAggregateReady(true)
和一些Spark Streaming输出操作(例如getAggregateReady(true)
)将数据保存到外部目标(DB或文件系统)。
之后,下一批中的数据将落入foreachRDD
,并将被代码updateStateByKey
删除。
removeIf(T::isAggregateReady)