Question

我从UDP套接字获取数据流（nginx在线日志），数据结构为：

date                | ip       | mac   | objectName | rate | size
2016-04-05 11:17:34 | 10.0.0.1 | e1:e2 | book1      | 10   | 121
2016-04-05 11:17:34 | 10.0.0.2 | a5:a8 | book2351   | 8    | 2342
2016-04-05 11:17:34 | 10.0.0.3 | d1:b56| bookA5     | 10   | 12

2016-04-05 11:17:35 | 10.0.0.1 | e1:e2 | book67     | 10   | 768
2016-04-05 11:17:35 | 10.0.0.2 | a5:a8 | book2351   | 8    | 897
2016-04-05 11:17:35 | 10.0.0.3 | d1:b56| bookA5     | 9    | 34
2016-04-05 11:17:35 | 10.0.0.4 | c7:c2 | book99     | 9    | 924
...
2016-04-05 11:18:01 | 10.0.0.1 | e1:e2 | book-10    | 8    | 547547
2016-04-05 11:18:17 | 10.0.0.4 | c7:c2 | book99     | 10   | 23423
2016-04-05 11:18:18 | 10.0.0.3 | d1:b56| bookA5     | 10   | 1138

我必须：

汇总数据，按分钟分区 - （分钟，IP，mac）
objectName - 可以在分钟内更改，我必须选择第一个，即2016-04-05 11:17:34 | 10.0.0.1 | e1:e2 book1已更改为book67，因此必须为book1
费率 - 在munute期间的更改率计算
大小 - 大小之间的差异（分区内的前一个时间，分区内的当前时间），即2016-04-05 11:17:34 | 10.0.0.1 | e1:e2 = ... 768 - 121

所以，结果（没有计算大小）：

date                | ip       | mac   | objectName | changes | size
2016-04-05 11:17:00 | 10.0.0.1 | e1:e2 | book1      | 0       | 768 - 121
2016-04-05 11:17:00 | 10.0.0.2 | a5:a8 | book2351   | 0       | 897 - 2342
2016-04-05 11:17:00 | 10.0.0.3 | d1:b56| bookA5     | 1       | 34 - 12    
2016-04-05 11:17:00 | 10.0.0.4 | c7:c2 | book99     | 0       | 924
...
2016-04-05 11:18:00 | 10.0.0.1 | e1:e2 | book-10    | 0       | 547547
2016-04-05 11:18:00 | 10.0.0.4 | c7:c2 | book99     | 0       | 23423
2016-04-05 11:18:00 | 10.0.0.3 | d1:b56| bookA5     | 0       | 1138

我的代码快照，我知道updateStateByKey和window ，但我无法理解，如何将数据刷新到数据库或文件系统，（分钟）改变了：

private static final Duration SLIDE_INTERVAL = Durations.seconds(10);
private static final String nginxLogHost = "localhost";
private static final int nginxLogPort = 9999;
private class Raw {
  LocalDate time; // full time with seconds
  String ip;
  String mac;
  String objectName;
  int rate;
  int size;
}
private class Key {
  LocalDate time; // time with 00 seconds
  String ip;
  String mac;
}
private class RawValue {
  LocalDate time; // full time with seconds
  String objectName;
  int rate;
  int size;
}
private class Value {
  String objectName;
  int changes;
  int size;
}
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("TestNginxLog");
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
    JavaStreamingContext jssc = new JavaStreamingContext(conf, SLIDE_INTERVAL);
    jssc.checkpoint("/tmp");
JavaReceiverInputDStream<Raw> logRecords = jssc.receiverStream(new NginxUDPReceiver(nginxLogHost, nginxLogPort));
 PairFunction<Raw, Key, RawValue> pairFunction = (PairFunction<Raw, Key, RawValue>) rawLine -> {
        LocalDateTime time = rawLine.getDateTime();
        Key k = new Key(LocalTime.of(time.getHour(), time.getMinute()), rawLine.getIp(), rawLine.getMac());
        RawValue v = new RawValue(time, rawLine.getObjectName(), rawLine.getRate(), rawLine.getSize());
        return new Tuple2<>(k, v);
    };
    JavaPairDStream<Key, RawValue> logDStream = logRecords.mapToPair(pairFunction);

Answer 1

这是部分回答，但问题尚未完成。 mapToPair之后我使用：

    // 1 key - N values
    JavaPairDStream<Key, Iterable<Value>> abonentConnects = logDStream.groupByKey();

    // Accumulate data
    Function2<List<Iterable<Value>>, Optional<List<Value>>, Optional<List<Value>>> updateFunc = (Function2<List<Iterable<Value>>, Optional<List<Value>>, Optional<List<Value>>>) (values, previousState) -> {
        List<Value> sum = previousState.or(new ArrayList<>());
        for (Iterable<Value> v : values) {
            v.forEach(sum::add);
        }
        return Optional.of(sum);
    };
    JavaPairDStream<Key, List<Value>> state = abonentConnects.updateStateByKey(updateFunc);

    // filter data (previous minute)
    Function<Tuple2<Key, List<Value>>, Boolean> filterFunc = (Function<Tuple2<Key, List<Value>>, Boolean>) v1 -> {
        LocalDateTime previousTime = LocalDateTime.now().minusMinutes(1).withSecond(0).withNano(0);
        LocalDateTime valueTime = v1._1().getTime();
        return valueTime.compareTo(previousTime) == 0;
    };
    JavaPairDStream<Key, List<Value>> filteredRecords = state.filter(filterFunc);

    // save data
    filteredRecords.foreachRDD(x -> {
        if (x.count() > 0) {
            x.saveAsTextFile("/tmp/xxx/grouped/" + LocalDateTime.now().toString().replace(":", "-").replace(".", "-"));
        }
    });

    streamingContext.start();
    streamingContext.awaitTermination();

结果数据生成，但由于每5秒执行一次操作，我每隔5秒就会获得相同的重复数据。
我知道，我必须使用Optional.absent()来清除流式传输中保存的数据。我试图使用它，但我无法在一个片段中组合：将数据保存到文件系统或HashMap立即清除保存的数据问题：我该怎么做？

Answer 2

所以，我正在通过自己的答案结束这个问题。您可以将此函数示例用作closeAccount的参数。此代码中的线索字词为：updateStateByKey以消除已保存的数据，Optional.absent()以对数据进行分组并Optional.of(...。最后一个用于通过过滤器setAggregateReady(true)和一些Spark Streaming输出操作（例如getAggregateReady(true)）将数据保存到外部目标（DB或文件系统）。之后，下一批中的数据将落入foreachRDD，并将被代码updateStateByKey删除。

removeIf(T::isAggregateReady)

用于时间序列处理的Spark流（按时间间隔划分数据）

2 个答案: