带有窗口动态目标的Google Dataflow到Google Cloud Storage

时间:2018-08-07 10:08:17

标签: google-cloud-dataflow apache-beam

我正在尝试将来自不同Google Pubsub主题的所有事件存档到Google Cloud Storage中。我目前有10个主题,并且发展很快。

我之所以选择Google Dataflow,是因为它具有可扩展性以及与其他Google服务的集成。

目前,我有一个数据流管道,可以使用所有主题。

当我写到单个输出位置时,我可以使用窗口化,并且可以成功写出。

我现在正尝试根据消息来自的主题将消息写到其他子文件夹中(该信息在消息中可用)。

当我调试管道时,它正确进入了getDestination方法,但是似乎从未进入getFilenamePolicy,因此也从未出现在我的Google Cloud Storage Bucket中。

我想念什么吗?我应该采用其他方法吗?

我意识到要解决我的问题,每个主题可以有一个单独的数据流,但是我认为很难维护主题的数量。

管道代码:

PCollectionList.of(pcollections).apply(Flatten.pCollections())
    .apply(
            options.getWindowDuration() + " Window",
            Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration())))
    // Apply windowed file writes. Use a NestedValueProvider because the filename
    // policy requires a resourceId generated from the input value at runtime.
    .apply(
            "Write File(s)",
            TextIO.write().withWindowedWrites()
                    .withNumShards(options.getNumShards())
                    .to(
                            new DynamicWindowedFilenamePolicy(
                                    options.getOutputDirectory(),
                                    options.getOutputFilenamePrefix(),
                                    options.getOutputShardTemplate(),
                                    options.getOutputFilenameSuffix()))
                    .withTempDirectory(NestedValueProvider.of(
                            options.getOutputDirectory(),
                            (SerializableFunction<String, ResourceId>) input ->
                                    FileBasedSink.convertToFileResourceIfPossible(input))));

DynamicWindowedFilenamePolicy类:

public class DynamicWindowedFilenamePolicy extends FileBasedSink.DynamicDestinations<String,String,String> {

private final ValueProvider<String> outputDirectory;
private final ValueProvider<String> outputFilenamePrefix;
private final ValueProvider<String> suffix;
private final ValueProvider<String> shardTemplate;

public DynamicWindowedFilenamePolicy(
        ValueProvider<String> outputDirectory,
        ValueProvider<String> outputFilenamePrefix,
        ValueProvider<String> shardTemplate,
        ValueProvider<String> suffix) {
    this.outputDirectory = outputDirectory;
    this.outputFilenamePrefix = outputFilenamePrefix;
    this.shardTemplate = shardTemplate;
    this.suffix = suffix;
}

public ResourceId windowedFilename(
        int shardNumber,
        int numShards,
        BoundedWindow window,
        PaneInfo paneInfo,
        OutputFileHints outputFileHints) {
...
}

private ResourceId resolveWithDateTemplates(
        ValueProvider<String> outputDirectoryStr, BoundedWindow window) {
...
}

@Override
public String formatRecord(String record) {
    return record;
}

@Override
public String getDestination(String element) {
    return "folder-determined-from-element";
}

@Override
public String getDefaultDestination() {
    return "default-desination";
}

@Override
public FilenamePolicy getFilenamePolicy(String destination) {
    return new FilenamePolicy() {
        @Override
        public ResourceId windowedFilename(int shardNumber, int numShards, BoundedWindow window, PaneInfo paneInfo, OutputFileHints outputFileHints) {
            return windowedFilename(shardNumber, numShards, window, paneInfo, outputFileHints);
        }

        @Nullable
        @Override
        public ResourceId unwindowedFilename(int shardNumber, int numShards, OutputFileHints outputFileHints) {
            return unwindowedFilename(shardNumber,numShards,outputFileHints);
        }
    };
}

}

0 个答案:

没有答案