使用Sessions窗口通过TextIO.write写入会引发GroupByKey消耗异常

时间:2017-10-27 20:57:41

标签: apache-beam

在Apache Beam 2.0.0中使用会话窗口和通过TextIO.write写入文件时,通过调用TextIO.write()生成以下异常:

java.lang.IllegalStateException: GroupByKey must have a valid Window merge function. Invalid because: WindowFn has already been consumed by previous GroupByKey

即使没有介入GroupByKey可能消耗窗口,也会发生异常。我已经包含了代码 - 主要功能说明了问题,并包含了2.0.0的帮助策略编写器类。

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.transforms.windowing.*;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TimestampedValue;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.joda.time.format.DateTimeFormatter;
import org.joda.time.format.ISODateTimeFormat;


public class TestSessionWindowToFile {
    /**
     * Support class: a filename policy for getting one file per window.
     * See https://github.com/apache/beam/blob/release-2.0.0/examples/java/src/main/java/org/apache/beam/examples/common/WriteOneFilePerWindow.java
     */
    public static class LocalPerWindowFiles extends FileBasedSink.FilenamePolicy {
        private static final DateTimeFormatter FORMATTER = ISODateTimeFormat.hourMinute();
        private final String prefix;

        public LocalPerWindowFiles(String prefix) {
            this.prefix = prefix;
        }

        public String filenamePrefixForWindow(IntervalWindow window) {
            return String.format("%s-%s-%s",
                    prefix, FORMATTER.print(window.start()), FORMATTER.print(window.end()));
        }

        @Override
        public ResourceId windowedFilename(
                ResourceId outputDirectory, WindowedContext context, String extension) {
            IntervalWindow window = (IntervalWindow) context.getWindow();
            String filename = String.format(
                    "%s-%s-of-%s%s",
                    filenamePrefixForWindow(window), context.getShardNumber(), context.getNumShards(),
                    extension);
            return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
        }

        @Override
        public ResourceId unwindowedFilename(
                ResourceId outputDirectory, Context context, String extension) {
            throw new UnsupportedOperationException("Unsupported.");
        }
    }


    /**
     * Creating a session windows and then asking TextIO to write the results leads to
     * "java.lang.IllegalStateException: GroupByKey must have a valid Window merge function.
     * Invalid because: WindowFn has already been consumed by previous GroupByKey"
     */
    public static void main(String[] args) {
        Pipeline p = Pipeline.create();

        PCollection<String> input = p.apply(
                Create.timestamped(
                        TimestampedValue.of("this", new Instant(1)),
                        TimestampedValue.of("is", new Instant(2)),
                        TimestampedValue.of("a", new Instant(3)),
                        TimestampedValue.of("test", new Instant(4)),
                        TimestampedValue.of("test", new Instant(5)),
                        TimestampedValue.of("test", new Instant(50)),
                        TimestampedValue.of("test", new Instant(51)),
                        TimestampedValue.of("test", new Instant(52))
                )
        );

        PCollection<String> windowedInputs = input
                // session windowing fails:
                .apply(Window.into(Sessions.withGapDuration(new org.joda.time.Duration(10))));
                // sliding windowing succeeds:
                //.apply(Window.into(SlidingWindows.of(new Duration(30)).every(new Duration(10))));

        // Invoke counting of elements so that sessioning is more clear
        PCollection<KV<String, Long>> counts =
                windowedInputs.apply(Count.perElement());
        PCollection<String> writeableStrings = counts.apply("Convert to text",
            ParDo.of(new DoFn<KV<String, Long>, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                String word = c.element().getKey();
                Long count = c.element().getValue();
                c.output(String.format("%s,%d", word, count));
            }
        }));

        writeableStrings
                .apply(TextIO.write()
                        .to("i_am_ignored_given_filename_policy")
                        .withFilenamePolicy(new LocalPerWindowFiles("results/testSessionWindow"))
                        .withWindowedWrites()
                        .withNumShards(1)
        );
        p.run();
    }
}

我看到澄清水印/触发,时间戳组合,Window.remerge()ing的意图没有效果, 或使用Beam 2.1.0(和Beam 2.1.0包含一个默认文件名策略,该策略知道如何编写窗口文件以及未窗口文件)。

日志记录表明会话已正确构建,并且SlidingWindow成功生成输出文件(使用.apply( Window.into(SlidingWindows.of(new Duration(30)).every(new Duration(10))));等变体代替Sessions)。这表明Sessions窗口+ TextIO.write的配置错误或行为不当。

如何修改此代码以为每个键+开始+结束窗口分组编写文本文件?

1 个答案:

答案 0 :(得分:1)

这是WriteFiles转换中的一个错误。我提交了https://issues.apache.org/jira/browse/BEAM-3122。不幸的是,我无法想到一个解决方法,而不是修复错误。