无法使用云数据流在Google存储桶中写入

时间:2019-09-25 10:44:08

标签: google-cloud-platform google-cloud-storage google-cloud-dataflow google-cloud-pubsub

基本上,我正在从pubsub读取数据并将数据写入Google存储。代码片段如下。

public class WriteWindowedFile extends PTransform<PCollection<String>, PDone> {

    private String bucketLocation;

    private LogTypeEnum logTypeEnum;

    private int shards;

    public WriteWindowedFile(String bucketLocation, LogTypeEnum logTypeEnum, int shards) {
        this.bucketLocation = bucketLocation;
        this.logTypeEnum = logTypeEnum;
        this.shards = shards;
    }

    @Override
    public PDone expand(PCollection<String> input) {
        checkArgument(input.getWindowingStrategy().getWindowFn().windowCoder() == IntervalWindow.getCoder());

        ResourceId resource = FileBasedSink.convertToFileResourceIfPossible(bucketLocation);

        return input.apply(
                TextIO.write()
                .to(new FileStorageFileNamePolicy(logTypeEnum))
                .withTempDirectory(resource.getCurrentDirectory())
                .withWindowedWrites()
                .withNumShards(shards)
        );
    }
}

FilenamePolicy实现为:

public class FileStorageFileNamePolicy extends FileBasedSink.FilenamePolicy {

    private static final long serialVersionUID = 1L;

    private static final Logger LOGGER = LoggerFactory.getLogger(FileStorageFileNamePolicy.class);

    private LogTypeEnum logTypeEnum;

    public FileStorageFileNamePolicy(LogTypeEnum logTypeEnum) {
        this.logTypeEnum = logTypeEnum;
    }

    @Override
    public ResourceId windowedFilename(int shardNumber,
                                       int numShards,
                                       BoundedWindow window,
                                       PaneInfo paneInfo,
                                       FileBasedSink.OutputFileHints outputFileHints) {
        IntervalWindow intervalWindow = (IntervalWindow) window;
        String startDate = intervalWindow.start().toString();
        String dateString = startDate.replace("T", CommonConstants.SPACE)
                .replaceAll(startDate.substring(startDate.indexOf('Z')), CommonConstants.EMPTY_STRING);
        try {
            startDate = DateUtil.getDateForFileStore(dateString, null);
        } catch (ParseException e) {
            LOGGER.error("Error converting date  : {}", e);
        }
        String filename = intervalWindow.start().toString() + ".txt";
        String dirName = startDate + CommonConstants.FORWARD_SLASH +
                logTypeEnum.getValue().toLowerCase() + CommonConstants.FORWARD_SLASH;
        LOGGER.info("Directory : {} and File Name : {}", dirName, filename);
        return FileBasedSink.convertToFileResourceIfPossible(filename).
                resolve(dirName, ResolveOptions.StandardResolveOptions.RESOLVE_DIRECTORY);
    }

    @Nullable
    @Override
    public ResourceId unwindowedFilename(
            int shardNumber, int numShards, FileBasedSink.OutputFileHints outputFileHints) {
        throw new UnsupportedOperationException("Unsupported");
    }
}

在写入Google存储空间时,即使我传递了实际的目录路径,我也面临以下问题。尝试解析FileStorageFileNamePolicy类中的目录时,进入以下stacktrace。

  

例外:“ java.lang.RuntimeException:   org.apache.beam.sdk.util.UserCodeException:   java.lang.IllegalStateException:预期路径是目录,但是   拥有[/2019-09-23T16:59:42.189Z.txt]。在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn $ 1.output(GroupAlsoByWindowsParDoFn.java:184)     在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner $ 1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:102)     在   org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)     在   org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)     在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:115)     在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)     在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)     在   org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1295)     在   org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access $ 1000(StreamingDataflowWorker.java:149)     在   org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker $ 6.run(StreamingDataflowWorker.java:1028)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)造成原因:   org.apache.beam.sdk.util.UserCodeException:   java.lang.IllegalStateException:预期路径是目录,但是   拥有[/2019-09-23T16:59:42.189Z.txt]。在   org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)     在   org.apache.beam.sdk.io.WriteFiles $ FinalizeTempFileBundles $ FinalizeFn $ DoFnInvoker.invokeProcessElement(未知   来源)   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:214)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:179)     在   org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:330)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)     在   org.apache.beam.runners.dataflow.worker.SimpleParDoFn $ 1.output(SimpleParDoFn.java:276)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:248)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access $ 700(SimpleDoFnRunner.java:74)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner $ DoFnProcessContext.output(SimpleDoFnRunner.java:560)     在   org.apache.beam.sdk.transforms.DoFnOutputReceivers $ WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)     在   org.apache.beam.sdk.transforms.MapElements $ 1.processElement(MapElements.java:139)     在   org.apache.beam.sdk.transforms.MapElements $ 1 $ DoFnInvoker.invokeProcessElement(未知   来源)   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:214)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:179)     在   org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:330)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)     在   org.apache.beam.runners.dataflow.worker.SimpleParDoFn $ 1.output(SimpleParDoFn.java:276)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:248)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access $ 700(SimpleDoFnRunner.java:74)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner $ DoFnProcessContext.output(SimpleDoFnRunner.java:560)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner $ DoFnProcessContext.output(SimpleDoFnRunner.java:548)     在   org.apache.beam.runners.dataflow.ReshuffleOverrideFactory $ ReshuffleWithOnlyTrigger $ 1.processElement(ReshuffleOverrideFactory.java:86)     在   org.apache.beam.runners.dataflow.ReshuffleOverrideFactory $ ReshuffleWithOnlyTrigger $ 1 $ DoFnInvoker.invokeProcessElement(未知   来源)   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:214)     在   org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:179)     在   org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:330)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)     在   org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)     在   org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn $ 1.output(GroupAlsoByWindowsParDoFn.java:182)     ... 17更多原因:java.lang.IllegalStateException:预期   path是目录,但具有[/2019-09-23T16:59:42.189Z.txt]。在   org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:588)     在   org.apache.beam.sdk.io.LocalResourceId.resolve(LocalResourceId.java:57)     在   org.apache.beam.sdk.io.LocalResourceId.resolve(LocalResourceId.java:36)     在   com.vuclip.dataflow.pipeline.helper.FileStorageFileNamePolicy.windowedFilename(FileStorageFileNamePolicy.java:54)     在   org.apache.beam.sdk.io.FileBasedSink $ FileResult.getDestinationFile(FileBasedSink.java:1086)     在   org.apache.beam.sdk.io.FileBasedSink $ WriteOperation.finalizeDestination(FileBasedSink.java:645)     在   org.apache.beam.sdk.io.WriteFiles.finalizeAllDestinations(WriteFiles.java:872)     在org.apache.beam.sdk.io.WriteFiles.access $ 1600(WriteFiles.java:111)     在   org.apache.beam.sdk.io.WriteFiles $ FinalizeTempFileBundles $ FinalizeFn.process(WriteFiles.java:849)

有人可以帮忙吗?谢谢

1 个答案:

答案 0 :(得分:0)

根据Stacktrace,好像您正在传递filename而不是dirName [/2019-09-23T16:59:42.189Z.txt]。如果您正在写Google Cloud Storage上的存储桶,那么我会期望类似于以下内容:

....
        return input.apply(
                TextIO.write()
                .to("gs://examplebucket/examplefolder/")
.....