Question

我有一堆输入流，它们可能会随着时间的推移发送更新。如果发生更新，我需要计算变化量，以便能够对其进行进一步处理。简而言之：

Input: 10 -> State: 10, Output: 10
Input: 12 -> State: 12, Output:  2
Input:  5 -> State:  5, Output: -7

我阅读了stateful processing和timely processing，以便了解如何在Apache Beam应用程序中使用这种状态，但是我不知道的是：

是否可以100％保证我的有状态DoFn不会并行处理具有相同密钥的项目？
我想确保在我的应用程序重新启动或失败时状态保持不变，以便可以使用正确的初始值开始。如何确保我的DoFn在关机之前“清理”（持久存储到数据存储区）？

对于＃2，我想知道在使用全局窗口时是否可行：

public class Delta extends DoFn<KV<String, Integer>, Integer> {
    @StateId("state")
    private final StateSpec<ValueState<Integer>> stateSpec = StateSpecs.value();

    @TimerId("timer")
    private final TimerSpec timerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME);

    @ProcessElement
    public void process(ProcessContext context,
                        BoundedWindow window,
                        @StateId("state") ValueState<Integer> state,
                        @TimerId("timer") Timer myTimer) {
        // Assign the timer to the end of the current window, which is a global window
        // Not sure if this always triggers when the application stops...
        myTimer.set(window.maxTimestamp());

        int value = context.element().getValue();
        int acc = getOrInitialize(state.read());
        int delta = value - acc;
        state.write(value);
        context.output(delta);
    }

    @OnTimer("timer")
    public void onTimer(OnTimerContext context,
                        @StateId("state") ValueState<Integer> state) {
        // Persist value of state here
    }

    private int getOrInitialize(Integer a) {
        // Get initial value of state here
        return (a != null) ? a : 0;
    }
}

Answer 1

是
如果您不配置任何BoundedWindow，我认为您的计时器方法将行不通。 @StartBundle / @Setup和@FinishBundle应该是恢复和检查点的更好位置。我不推荐@Teardown，因为它不能保证被调用。

Answer 2

状态处理按每个键和窗口并行化。同一键但窗口不同的两个元素可以并行处理。但是您的元素都在全局窗口中，因此与您的情况下的每个键并行性相同。
使用计时器是刷新状态的正确方法。全局窗口末尾的事件时间计时器在管道正常运行期间不会触发，但会在“排水”情况下触发，在该情况下，所有水印都移至无穷大。 Drain是Cloud Dataflow的一项功能，已提出将其作为Beam的可移植概念，但是您应该研究选择的运行器是否具有这种功能。

使用状态处理计算Apache Beam中的增量

2 个答案: