Question

我正在通过pubsub接收消息。每条消息都应作为粗略数据存储在GCS中自己的文件中，对数据执行一些处理，然后将其保存到大查询中 - 在数据中包含文件名。

收到后，应立即在BQ中看到数据。

示例：

data published to pubsub : {a:1, b:2} 
data saved to GCS file UUID: A1F432 
data processing :  {a:1, b:2} -> 
                   {a:11, b: 22} -> 
                   {fileName: A1F432, data: {a:11, b: 22}} 
data in BQ : {fileName: A1F432, data: {a:11, b: 22}}

这个想法是处理后的数据存储在BQ中，该BQ具有存储在GCS中的粗略数据的链接。

这是我的代码。

public class BotPipline {

public static void main(String[] args) {

    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setRunner(BlockingDataflowPipelineRunner.class);
    options.setProject(MY_PROJECT);
    options.setStagingLocation(MY_STAGING_LOCATION);
    options.setStreaming(true);

    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> input = pipeline.apply(PubsubIO.Read.subscription(MY_SUBSCRIBTION));

    String uuid = ...;
    input.apply(TextIO.Write.to(MY_STORAGE_LOCATION + uuid));

    input
    .apply(ParDo.of(new DoFn<String,String>(){..}).named("updateJsonAndInsertUUID"))
    .apply(convertToTableRow(...)).named("convertJsonStringToTableRow"))
            .apply(BigQueryIO.Write.to(MY_BQ_TABLE).withSchema(tableSchema)
    );
    pipeline.run();
}

由于不支持在TextIO.Write中编写无界集合，因此我的代码无法运行。经过一些研究后，我发现我有几个选项可以解决这个问题：

在数据流中创建自定义接收器
实施写作GCS作为我自己的DoFn
使用可选的BoundedWindow

我不知道如何开始。任何人都可以为我提供以下解决方案之一的代码，或者给我一个与我的情况相符的不同解决方案。（提供代码）

Answer 1

最佳选择是＃2 - 一个简单的class CreateFileFn extends DoFn<String, Void> { @ProcessElement public void process(ProcessContext c) throws IOException { String filename = ...generate filename from element...; try (WritableByteChannel channel = FileSystems.create( FileSystems.matchNewResource(filename, false), "application/text-plain")) { OutputStream out = Channels.newOutputStream(channel); ...write the element to out... } } }，可根据您的数据创建文件。像这样：

    public int InsertItem(string itemText)
        {
            ObjectParameter InsertedId = new ObjectParameter("InsertedID", -1);
            _db.usp_Fee_insert(itemText, InsertedId);
            return (int)InsertedId.Value;
        }

将通过PubSub接收的每一行写入Cloud Storage上的自己的文件

1 个答案: