Question

让我简化一下案例。我正在使用Apache Beam 0.6.0。我的最终处理结果是PCollection<KV<String, String>>。我想将值写入与其键对应的不同文件。

例如，让我们说结果由

组成

(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)

然后我想将value1，value3和value4写入key1.txt，并将value4写入key2.txt。

就我而言：

密钥集是在管道运行时确定的，而不是在构建管道时确定的。
键集可能非常小，但与每个键对应的值的数量可能非常大。

有什么想法吗？

Answer 1

当然，我前几天写了一个这个案例的样本。

此示例为数据流1.x样式

基本上，您按每个键分组，然后您可以使用连接到云存储的自定义转换来执行此操作。需要注意的是，每个文件的行列表不应该很大（它必须适合单个实例的内存，但考虑到你可以运行高内存实例，这个限制非常高）。

...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
            .apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
    readyToWrite.apply(
            new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...

然后进行大部分工作的转换是：

public class PTransformWriteToGCS
    extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {

private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);

private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();

private final String bucketName;

private final SerializableFunction<String, String> pathCreator;

public PTransformWriteToGCS(final String bucketName,
        final SerializableFunction<String, String> pathCreator) {
    this.bucketName = bucketName;
    this.pathCreator = pathCreator;

}

@Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {

    return input
            .apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {

                @Override
                public void processElement(
                        final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
                        throws Exception {
                    final String key = arg0.element().getKey();
                    final List<String> values = arg0.element().getValue();
                    final String toWrite = values.stream().collect(Collectors.joining("\n"));
                    final String path = pathCreator.apply(key);
                    BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
                            .setContentType(MimeTypes.TEXT)
                            .build();
                    LOG.info("blob writing to: {}", blobInfo);
                    Blob result = STORAGE.create(blobInfo,
                            toWrite.getBytes(StandardCharsets.UTF_8));
                }
            }));
}

}

Answer 2

只需在ParDo函数中编写一个循环！更多细节 - 今天我有同样的情况，唯一的问题是在我的情况下key = image_label和value = image_tf_record。就像你问的那样，我正在尝试创建单独的TFRecord文件，每个类一个，每个记录文件包含许多图像。但是，当您的方案中每个键的数量非常高时，不确定是否存在内存问题：（我的代码也是用Python编写的）

class WriteToSeparateTFRecordFiles(beam.DoFn):

def __init__(self, outdir):
    self.outdir = outdir

def process(self, element):
    l, image_list = element
    writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
    for example in image_list:
        writer.write(example.SerializeToString())
    writer.close()

然后在你的管道中，在获得键值对的阶段之后添加这两行：

   (p
    | 'GroupByLabelId' >> beam.GroupByKey()
    | 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
    )

Answer 3

在Apache Beam 2.2 Java SDK中，flightprices.length和TextIO本身支持使用AvroIO和TextIO。参见例如this method

更新（2018）：更喜欢将AvroIO.write().to(DynamicDestinations)与FileIO.writeDynamic()和TextIO.sink()一起使用。

Answer 4

您可以为此使用FileIO.writeDinamic（）

PCollection<KV<String,String>> readfile= (something you read..);

readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("somefolder")
    .withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));

p.run();

Answer 5

在ParDo课程中写下以下几行：

from apache_beam.io import filesystems

eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
    eventCSVFileWriter.write(record)

如果您需要完整的代码，我也可以帮助您。

如何写入Apache Beam中的多个文件？

5 个答案: