如何在Apache Beam 2.6中使用FileIO.writeDynamic()写入多个输出路径?

时间:2018-08-16 14:21:25

标签: apache-beam apache-beam-io

我正在使用Apache Beam 2.6从单个Kafka主题读取并将输出写入Google Cloud Storage(GCS)。现在,我想更改管道,以使其读取多个主题并将其写为gs://bucket/topic/...

当只阅读一个主题时,我在管道的最后一步中使用了TextIO

TextIO.write()
    .to(
        new DateNamedFiles(
            String.format("gs://bucket/data%s/", suffix), currentMillisString))
    .withWindowedWrites()
    .withTempDirectory(
        FileBasedSink.convertToFileResourceIfPossible(
            String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
    .withNumShards(1));

This是一个类似的问题,我尝试适应哪种代码。

FileIO.<EventType, Event>writeDynamic()
    .by(
        new SerializableFunction<Event, EventType>() {
          @Override
          public EventType apply(Event input) {
            return EventType.TRANSFER; // should return real type here, just a dummy
          }
        })
    .via(
        Contextful.fn(
            new SerializableFunction<Event, String>() {
              @Override
              public String apply(Event input) {
                return "Dummy"; // should return the Event converted to a String
              }
            }),
        TextIO.sink())
    .to(DynamicFileDestinations.constant(new DateNamedFiles("gs://bucket/tmp%s/%s/",
                                                            currentMillisString),
        new SerializableFunction<String, String>() {
          @Override
          public String apply(String input) {
            return null; // Not sure what this should exactly, but it needs to 
                         // include the EventType into the path
          }
        }))
    .withTempDirectory(
        FileBasedSink.convertToFileResourceIfPossible(
            String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
    .withNumShards(1))

official JavaDoc包含示例代码,这些代码似乎具有过时的方法签名。 (.via方法似乎已经切换了参数的顺序)。另外,我在FileIO中偶然发现了一个使我感到困惑的示例-TransactionTypeTransaction in this line是否应该更改位置?

1 个答案:

答案 0 :(得分:5)

经过一夜的睡眠和崭新的起点,我想出了解决方案,我使用了Java 8实用的样式,因为它使代码更短(并且更具可读性):

  .apply(
    FileIO.<String, Event>writeDynamic()
        .by((SerializableFunction<Event, String>) input -> input.getTopic())
        .via(
            Contextful.fn(
                (SerializableFunction<Event, String>) input -> input.getPayload()),
            TextIO.sink())
        .to(String.format("gs://bucket/data%s/", suffix)
        .withNaming(type -> FileNaming.getNaming(type, "", currentMillisString))
        .withDestinationCoder(StringUtf8Coder.of())
        .withTempDirectory(
            String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString))
        .withNumShards(1));

说明:

  • Event是Java POJO,其中包含Kafka消息的有效负载及其所属的主题,它在ParDo步骤之后在KafkaIO中进行解析
  • suffixdev或为空,由环境变量设置
  • currentMillisString包含整个管道的时间戳记 已启动,以便在以下情况下新文件不会覆盖GCS上的旧文件: 管道重新启动
  • FileNaming实现自定义命名,并在其构造函数中接收事件的类型(主题),它使用自定义格式化程序在GCS上写入每日分区的“子文件夹”:

    class FileNaming implements FileIO.Write.FileNaming {
      static FileNaming getNaming(String topic, String suffix, String currentMillisString) {
        return new FileNaming(topic, suffix, currentMillisString);
      }
    
      private static final DateTimeFormatter FORMATTER = DateTimeFormat
          .forPattern("yyyy-MM-dd").withZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("Europe/Zurich")));
    
      private final String topic;
      private final String suffix;
      private final String currentMillisString;
    
      private String filenamePrefixForWindow(IntervalWindow window) {
        return String.format(
            "%s/%s/%s_", topic, FORMATTER.print(window.start()), currentMillisString);
      }
    
      private FileNaming(String topic, String suffix, String currentMillisString) {
        this.topic = topic;
        this.suffix = suffix;
        this.currentMillisString = currentMillisString;
      }
    
      @Override
      public String getFilename(
          BoundedWindow window,
          PaneInfo pane,
          int numShards,
          int shardIndex,
          Compression compression) {
    
        IntervalWindow intervalWindow = (IntervalWindow) window;
        String filenamePrefix = filenamePrefixForWindow(intervalWindow);
        String filename =
            String.format(
                "pane-%d-%s-%05d-of-%05d%s",
                pane.getIndex(),
                pane.getTiming().toString().toLowerCase(),
                shardIndex,
                numShards,
                suffix);
        String fullName = filenamePrefix + filename;
        return fullName;
      }
    }