在Apache Beam 2.0.0中使用会话窗口和通过TextIO.write写入文件时,通过调用TextIO.write()生成以下异常:
java.lang.IllegalStateException: GroupByKey must have a valid Window merge function. Invalid because: WindowFn has already been consumed by previous GroupByKey
即使没有介入GroupByKey
可能消耗窗口,也会发生异常。我已经包含了代码 - 主要功能说明了问题,并包含了2.0.0的帮助策略编写器类。
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.transforms.windowing.*;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TimestampedValue;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.joda.time.format.DateTimeFormatter;
import org.joda.time.format.ISODateTimeFormat;
public class TestSessionWindowToFile {
/**
* Support class: a filename policy for getting one file per window.
* See https://github.com/apache/beam/blob/release-2.0.0/examples/java/src/main/java/org/apache/beam/examples/common/WriteOneFilePerWindow.java
*/
public static class LocalPerWindowFiles extends FileBasedSink.FilenamePolicy {
private static final DateTimeFormatter FORMATTER = ISODateTimeFormat.hourMinute();
private final String prefix;
public LocalPerWindowFiles(String prefix) {
this.prefix = prefix;
}
public String filenamePrefixForWindow(IntervalWindow window) {
return String.format("%s-%s-%s",
prefix, FORMATTER.print(window.start()), FORMATTER.print(window.end()));
}
@Override
public ResourceId windowedFilename(
ResourceId outputDirectory, WindowedContext context, String extension) {
IntervalWindow window = (IntervalWindow) context.getWindow();
String filename = String.format(
"%s-%s-of-%s%s",
filenamePrefixForWindow(window), context.getShardNumber(), context.getNumShards(),
extension);
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
@Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
/**
* Creating a session windows and then asking TextIO to write the results leads to
* "java.lang.IllegalStateException: GroupByKey must have a valid Window merge function.
* Invalid because: WindowFn has already been consumed by previous GroupByKey"
*/
public static void main(String[] args) {
Pipeline p = Pipeline.create();
PCollection<String> input = p.apply(
Create.timestamped(
TimestampedValue.of("this", new Instant(1)),
TimestampedValue.of("is", new Instant(2)),
TimestampedValue.of("a", new Instant(3)),
TimestampedValue.of("test", new Instant(4)),
TimestampedValue.of("test", new Instant(5)),
TimestampedValue.of("test", new Instant(50)),
TimestampedValue.of("test", new Instant(51)),
TimestampedValue.of("test", new Instant(52))
)
);
PCollection<String> windowedInputs = input
// session windowing fails:
.apply(Window.into(Sessions.withGapDuration(new org.joda.time.Duration(10))));
// sliding windowing succeeds:
//.apply(Window.into(SlidingWindows.of(new Duration(30)).every(new Duration(10))));
// Invoke counting of elements so that sessioning is more clear
PCollection<KV<String, Long>> counts =
windowedInputs.apply(Count.perElement());
PCollection<String> writeableStrings = counts.apply("Convert to text",
ParDo.of(new DoFn<KV<String, Long>, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
String word = c.element().getKey();
Long count = c.element().getValue();
c.output(String.format("%s,%d", word, count));
}
}));
writeableStrings
.apply(TextIO.write()
.to("i_am_ignored_given_filename_policy")
.withFilenamePolicy(new LocalPerWindowFiles("results/testSessionWindow"))
.withWindowedWrites()
.withNumShards(1)
);
p.run();
}
}
我看到澄清水印/触发,时间戳组合,Window.remerge()ing的意图没有效果, 或使用Beam 2.1.0(和Beam 2.1.0包含一个默认文件名策略,该策略知道如何编写窗口文件以及未窗口文件)。
日志记录表明会话已正确构建,并且SlidingWindow成功生成输出文件(使用.apply( Window.into(SlidingWindows.of(new Duration(30)).every(new Duration(10))));
等变体代替Sessions
)。这表明Sessions窗口+ TextIO.write的配置错误或行为不当。
如何修改此代码以为每个键+开始+结束窗口分组编写文本文件?
答案 0 :(得分:1)
这是WriteFiles转换中的一个错误。我提交了https://issues.apache.org/jira/browse/BEAM-3122。不幸的是,我无法想到一个解决方法,而不是修复错误。