附加模式下的火花水印和加窗

时间:2018-11-23 17:35:12

标签: apache-spark spark-structured-streaming

在15分钟的幻灯片中,每隔24小时间隔下面的结构化流代码水印和Windows数据。代码在追加模式下仅生成空的批次0。在更新模式下,结果将正确显示。需要附加模式,因为S3接收器仅在附加模式下工作。

String windowDuration = "24 hours";
String slideDuration = "15 minutes";
Dataset<Row> sliding24h = rowData
        .withWatermark(eventTimeCol, slideDuration)
        .groupBy(functions.window(col(eventTimeCol), windowDuration, slideDuration),
                col(nameCol)).count();

sliding24h
        .writeStream()
        .format("console")
        .option("truncate", false)
        .option("numRows", 1000)
        .outputMode(OutputMode.Append())
        //.outputMode(OutputMode.Complete())
        .start()
        .awaitTermination();

下面是完整的测试代码:

public static void main(String [] args) throws StreamingQueryException {
     SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();

     ArrayList<String> rl = new ArrayList<>();
     for (int i = 0; i < 200; ++i) {
         long t = 1512164314L + i * 5 * 60;
         rl.add(t + ",qwer");
     }

     String nameCol = "name";
     String eventTimeCol = "eventTime";
     String eventTimestampCol = "eventTimestamp";

     MemoryStream<String> input = new MemoryStream<>(42, spark.sqlContext(), Encoders.STRING());
     input.addData(JavaConversions.asScalaBuffer(rl).toSeq());
     Dataset<Row> stream = input.toDF().selectExpr(
             "cast(split(value,'[,]')[0] as long) as " + eventTimestampCol,
             "cast(split(value,'[,]')[1] as String) as " + nameCol);

     System.out.println("isStreaming: " +  stream.isStreaming());

     Column eventTime = functions.to_timestamp(col(eventTimestampCol));
     Dataset<Row> rowData = stream.withColumn(eventTimeCol, eventTime);

     String windowDuration = "24 hours";
     String slideDuration = "15 minutes";
     Dataset<Row> sliding24h = rowData
             .withWatermark(eventTimeCol, slideDuration)
             .groupBy(functions.window(col(eventTimeCol), windowDuration, slideDuration),
                     col(nameCol)).count();

     sliding24h
             .writeStream()
             .format("console")
             .option("truncate", false)
             .option("numRows", 1000)
             .outputMode(OutputMode.Append())
             //.outputMode(OutputMode.Complete())
             .start()
             .awaitTermination();
}

1 个答案:

答案 0 :(得分:1)