在批处理管道

时间:2016-11-17 20:26:05

标签: google-cloud-dataflow apache-beam

我一直在谷歌云数据流服务上每天运行的批处理管道上收到此消息。它已经开始失败,并显示以下消息:

(88b342a0e3852af3): java.io.IOException: INVALID_ARGUMENT: Received message larger than max (21824326 vs. 4194304) 
dataflow-batch-jetty-11171129-7ea5-harness-waia talking to localhost:12346 at
com.google.cloud.dataflow.sdk.runners.worker.ApplianceShuffleWriter.close(Native Method) at 
com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.close(ChunkingShuffleEntryWriter.java:67) at 
com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.close(ShuffleSink.java:286) at 
com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:264) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:197) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:149) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at 
java.lang.Thread.run(Thread.java:745)

我仍在使用旧的解决方法输出带有标题的CSV文件,例如

PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
    String new_line = System.getProperty("line.separator");
    String csv_header = "id, stuff_1, stuff_2" + new_line;
    StringBuilder csv_body = new StringBuilder().append(csv_header);

    @Override
    public void processElement(ProcessContext c) {
        csv_body.append(c.element()).append(newline);
    }

    @Override
    public void finishBundle(Context c) throws Exception {
        c.output(csv_body.toString());
    }

})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));

造成这种情况的原因是什么?这个DoFn的输出现在太大了吗?正在处理的数据集的大小没有增加。

1 个答案:

答案 0 :(得分:1)

这看起来可能是我们方面的一个错误而且我们正在调查它,但一般来说代码可能没有按照您的意图去做。

如上所述,您最终将得到一个未指定数量的输出文件,其名称以给定前缀开头,每个文件包含针对不同块的预期类CSV输出(包括标题)的串联数据,未指定的顺序。

为了正确实现对CSV文件的写入,只需使用TextIO.Write.withHeader()指定标题,然后完全删除构建CSV的ParDo。这也不会触发错误。