我正在使用Apache Beam 2.6从单个Kafka主题读取并将输出写入Google Cloud Storage(GCS)。现在,我想更改管道,以使其读取多个主题并将其写为gs://bucket/topic/...
当只阅读一个主题时,我在管道的最后一步中使用了TextIO
:
TextIO.write()
.to(
new DateNamedFiles(
String.format("gs://bucket/data%s/", suffix), currentMillisString))
.withWindowedWrites()
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1));
This是一个类似的问题,我尝试适应哪种代码。
FileIO.<EventType, Event>writeDynamic()
.by(
new SerializableFunction<Event, EventType>() {
@Override
public EventType apply(Event input) {
return EventType.TRANSFER; // should return real type here, just a dummy
}
})
.via(
Contextful.fn(
new SerializableFunction<Event, String>() {
@Override
public String apply(Event input) {
return "Dummy"; // should return the Event converted to a String
}
}),
TextIO.sink())
.to(DynamicFileDestinations.constant(new DateNamedFiles("gs://bucket/tmp%s/%s/",
currentMillisString),
new SerializableFunction<String, String>() {
@Override
public String apply(String input) {
return null; // Not sure what this should exactly, but it needs to
// include the EventType into the path
}
}))
.withTempDirectory(
FileBasedSink.convertToFileResourceIfPossible(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString)))
.withNumShards(1))
official JavaDoc包含示例代码,这些代码似乎具有过时的方法签名。 (.via
方法似乎已经切换了参数的顺序)。另外,我在FileIO
中偶然发现了一个使我感到困惑的示例-TransactionType
和Transaction
in this line是否应该更改位置?
答案 0 :(得分:5)
经过一夜的睡眠和崭新的起点,我想出了解决方案,我使用了Java 8实用的样式,因为它使代码更短(并且更具可读性):
.apply(
FileIO.<String, Event>writeDynamic()
.by((SerializableFunction<Event, String>) input -> input.getTopic())
.via(
Contextful.fn(
(SerializableFunction<Event, String>) input -> input.getPayload()),
TextIO.sink())
.to(String.format("gs://bucket/data%s/", suffix)
.withNaming(type -> FileNaming.getNaming(type, "", currentMillisString))
.withDestinationCoder(StringUtf8Coder.of())
.withTempDirectory(
String.format("gs://bucket/tmp%s/%s/", suffix, currentMillisString))
.withNumShards(1));
说明:
Event
是Java POJO,其中包含Kafka消息的有效负载及其所属的主题,它在ParDo
步骤之后在KafkaIO
中进行解析suffix
是dev
或为空,由环境变量设置currentMillisString
包含整个管道的时间戳记
已启动,以便在以下情况下新文件不会覆盖GCS上的旧文件:
管道重新启动 FileNaming
实现自定义命名,并在其构造函数中接收事件的类型(主题),它使用自定义格式化程序在GCS上写入每日分区的“子文件夹”:>
class FileNaming implements FileIO.Write.FileNaming {
static FileNaming getNaming(String topic, String suffix, String currentMillisString) {
return new FileNaming(topic, suffix, currentMillisString);
}
private static final DateTimeFormatter FORMATTER = DateTimeFormat
.forPattern("yyyy-MM-dd").withZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("Europe/Zurich")));
private final String topic;
private final String suffix;
private final String currentMillisString;
private String filenamePrefixForWindow(IntervalWindow window) {
return String.format(
"%s/%s/%s_", topic, FORMATTER.print(window.start()), currentMillisString);
}
private FileNaming(String topic, String suffix, String currentMillisString) {
this.topic = topic;
this.suffix = suffix;
this.currentMillisString = currentMillisString;
}
@Override
public String getFilename(
BoundedWindow window,
PaneInfo pane,
int numShards,
int shardIndex,
Compression compression) {
IntervalWindow intervalWindow = (IntervalWindow) window;
String filenamePrefix = filenamePrefixForWindow(intervalWindow);
String filename =
String.format(
"pane-%d-%s-%05d-of-%05d%s",
pane.getIndex(),
pane.getTiming().toString().toLowerCase(),
shardIndex,
numShards,
suffix);
String fullName = filenamePrefix + filename;
return fullName;
}
}