window Apache光束中的每小时(顺时针)数据

时间:2018-06-06 07:38:51

标签: apache apache-beam dataflow beam

我正在尝试聚合DataFlow / Apache Beam Job中每小时的流数据(如12:00到12:59和01:00到01:59)。

以下是我的用例

数据从pubsub流式传输,它有一个时间戳(订单日期)。我想在每小时得到没有订单,我也希望延迟5个小时。以下是我正在使用的示例代码

    LOG.info("Start Running Pipeline");
    DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);

    Pipeline pipeline = Pipeline.create(options);
    PCollection<String>  directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
    PCollection<String>  tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));

    PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
            .apply("Flatten Data from PubSub", Flatten.<String>pCollections());

    flattenData
        .apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))

        // Adding Window

        .apply(
                Window.<SalesAndUnits>into(
                            SlidingWindows.of(Duration.standardMinutes(15))
                            .every(Duration.standardMinutes(1)))
                            )

        // Data Enrich with Dimensions
        .apply(ParDo.of(new DataEnrichWithDimentions()))

        // Group And Hourly Sum
        .apply(new GroupAndSumSales())

        .apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
    pipeline.run();
    LOG.info("Finish Running Pipeline");

1 个答案:

答案 0 :(得分:0)

我使用的窗口符合您的要求。

的内容
Window.into(
  FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))

可能接着是count,这就是我理解你所需要的。

希望有所帮助