我正在尝试构建Google DataFlow管道,其中包含以下步骤:
我的问题是我无法在最终输出消息中添加文件名。 目前的实施:
ConnectorOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(ConnectorOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadFromTopic", PubsubIO.readMessages().fromTopic(options.getInputTopic()))
.apply("CollectFiles", ParDo.of(new DoFn<PubsubMessage, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
String fileName = new String(c.element().getPayload());
c.output("gs://bucket-name/" + fileName);
}
}))
.apply("ReadLines", TextIO.readAll())
.apply("WriteItemsToTopic", PubsubIO.writeStrings().to(options.getOutputTopic()));
p.run().waitUntilFinish();
我在here之前看到了类似的问题,但它对我来说并不是一个真正有效的解决方案,因为我必须将文件名附加到每个输出消息,而不是每行解析。 有谁可以让我知道可能的解决方案?
更新
谢谢@jkff,我按照你的建议和我目前的解决方案代码:
ConnectorOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(ConnectorOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadFromTopic", PubsubIO.readMessages().fromSubscription(options.getInputSubscription()))
.apply("PrintMessages", ParDo.of(new DoFn<PubsubMessage, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
String message = new String(c.element().getPayload());
c.output("gs://bucket/" + message);
}
}))
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply("ReadFile", ParDo.of(new DoFn<FileIO.ReadableFile, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws IOException {
FileIO.ReadableFile f = c.element();
String filePath = f.getMetadata().resourceId().toString();
String fileName = filePath.substring(filePath.lastIndexOf("/") + 1);
ReadableByteChannel inChannel = f.open();
ByteBuffer buffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
while (inChannel.read(buffer) > 0) {
buffer.flip();
for (int i = 0; i < buffer.limit(); i++) {
char ch = ((char) buffer.get());
if (ch == '\r') {
c.output(line.toString() + " " + fileName);
line = new StringBuffer();
} else {
line.append(ch);
}
}
buffer.clear();
}
inChannel.close();
}
}))
.apply("WriteItemsToTopic", PubsubIO.writeStrings().to(options.getOutputTopic()));
p.run().waitUntilFinish();
答案 0 :(得分:2)
您可以使用FileIO
- 使用FileIO.matchAll()
后跟FileIO.readMatches()
获取PCollection<ReadableFile>
,其中每个ReadableFile
都可用于获取文件名和读取文件。跟随它DoFn
做你想做的事。要读取文件,请使用ReadableFile
.open()
上的标准Java库工具。