基于元素值的数据流写入Google云端存储

时间:2017-07-18 21:11:17

标签: google-cloud-dataflow apache-beam

我尝试构建数据流流程,通过将数据存储到Google云端存储中来帮助归档数据。我有一个事件数据的PubSub流,其中包含client_id和一些元数据。此过程应归档所有传入事件,因此需要成为流式传输管道。

我希望能够通过将我收到的每个事件放在一个看起来像gs://archive/client_id/eventdata.json的存储桶中来处理事件存档。这可以在dataflow / apache beam中进行,特别是能够为PCollection中的每个事件分配不同的文件名吗?

编辑: 所以我的代码目前看起来像:

public static class PerWindowFiles extends FileBasedSink.FilenamePolicy {

private String customerId;

public PerWindowFiles(String customerId) {
  this.customerId = customerId;
}

@Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
  String filename = bucket+"/"+customerId;
  return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}

@Override
public ResourceId unwindowedFilename(
    ResourceId outputDirectory, Context context, String extension) {
  throw new UnsupportedOperationException("Unsupported.");
}
}


public static void main(String[] args) throws IOException {
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args)
    .withValidation()
    .as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);

PCollection<Event> set = p.apply(PubsubIO.readStrings()
                                     .fromTopic("topic"))
    .apply(new ConvertToEvent()));

PCollection<KV<String, Event>> events = labelEvents(set);
PCollection<KV<String, EventGroup>> sessions = groupEvents(events);

String customers = System.getProperty("CUSTOMERS");
JSONArray custList = new JSONArray(customers);
for (Object cust : custList) {
  if (cust instanceof String) {
    String customerId = (String) cust;
    PCollection<KV<String, EventGroup>> custCol = sessions.apply(new FilterByCustomer(customerId));
            stringifyEvents(custCol)
                .apply(TextIO.write()
                                               .to("gs://archive/")
                                               .withFilenamePolicy(new PerWindowFiles(customerId))
                                               .withWindowedWrites()
                                               .withNumShards(3));
  } else {
    LOG.info("Failed to create TextIO: customerId was not String");
  }
}

p.run()
    .waitUntilFinish();
}

此代码很难看,因为每次新客户端发生时我都需要重新部署才能保存数据。我希望能够动态地将客户数据分配给适当的存储桶。

1 个答案:

答案 0 :(得分:2)

“动态目的地” - 根据正在编写的元素选择文件名 - 将是Beam 2.1.0中可用的新功能,尚未发布。