在Apache Beam中创建自定义窗口功能

时间:2018-09-12 16:59:06

标签: google-cloud-dataflow apache-beam dataflow

我有一个Beam管道,首先读取多个文本文件,其中文件中的每一行都代表一行,稍后将在该管道的Bigtable中插入该行。该场景需要确认从每个文件提取的行数和以后插入Bigtable的行数匹配。为此,我计划开发一种自定义的Windowing策略,以便将基于文件名的键将单个文件中的行分配给单个窗口,并将其传递给Windowing函数。

是否有用于创建自定义Windowing函数的代码示例?

1 个答案:

答案 0 :(得分:0)

尽管我更改了确认插入行数的策略,但对于对从批处理源读取的窗口元素感兴趣的任何人,例如FileIO在批处理作业中,以下是用于创建自定义窗口策略的代码:

public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{

private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);

@Override
public IntervalWindow assignWindow(Instant timestamp) {
    Instant end = new Instant(timestamp.getMillis() + 1);
    IntervalWindow interval = new IntervalWindow(timestamp, end);
    LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
    return interval;
}

@Override
public boolean isCompatible(WindowFn<?, ?> other) {
    return this.equals(other);
}

@Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
    if (!this.isCompatible(other)) {
        throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
    }
  }

@Override
public Coder<IntervalWindow> windowCoder() {
    return IntervalWindow.getCoder();
}   

}

,然后可以在管道中使用它,如下所示:

p
 .apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
 .apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
  .withAllowedLateness(Duration.standardMinutes(1))
  .discardingFiredPanes());

请记住,您需要编写AssignTimestampFn(),以便每封邮件都带有时间戳。