我试图从BigQuery处理2.5TB的数据。管道开始代码:
Pipeline p = Pipeline.create(options);
p.apply(BigQueryIO.Read.fromQuery(
"select * from table_query(Events, 'table_id contains \"20150601\"') where id is not null"))
.apply(ParDo.of(new DoFn<TableRow, KV<String, TableRow>>() {
@Override
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of((String) c.element().get("id"), c.element()));
}
})).apply(GroupByKey.<String, TableRow>create())
对于DataflowPipelineOptions,我只将登台位置设置为GCS和项目上的文件夹。
Job已经在GCP上成功启动了一段时间。由于内部io错误,最终作业状态失败。
Jul 16, 2015, 8:45:47 PM(297a156f6f2a50b2): java.lang.RuntimeException:java.io.IOException: INTERNAL: IO error: /var/shuffle/sorted-dataset-4/1011: No space left on device when talking to tcp://localhost:12345
at com.google.cloud.dataflow.sdk.repackaged.com.google.common.base.Throwables.propagate(Throwables.java:160)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:154)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:117)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnContext.outputWindowedValue(DoFnRunner.java:314)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnProcessContext.output(DoFnRunner.java:475)
at com.google.cloud.dataflow.sdk.util.ReifyTimestampAndWindowsDoFn.processElement(ReifyTimestampAndWindowsDoFn.java:40)
Caused by: java.io.IOException: INTERNAL: IO error: /var/shuffle/sorted-dataset-4/1011: No space left on device when talking to tcp://localhost:12345
at com.google.cloud.dataflow.sdk.runners.worker.ApplianceShuffleWriter.write(Native Method)
at com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.writeChunk(ChunkingShuffleEntryWriter.java:72)
at com.google.cloud.dataflow.sdk.runners.worker.ChunkingShuffleEntryWriter.put(ChunkingShuffleEntryWriter.java:56)
at com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.add(ShuffleSink.java:258)
at com.google.cloud.dataflow.sdk.runners.worker.ShuffleSink$ShuffleSinkWriter.add(ShuffleSink.java:169)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.process(WriteOperation.java:90)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:147)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:152)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:117)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnContext.outputWindowedValue(DoFnRunner.java:314)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnProcessContext.output(DoFnRunner.java:475)
at com.google.cloud.dataflow.sdk.util.ReifyTimestampAndWindowsDoFn.processElement(ReifyTimestampAndWindowsDoFn.java:40)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.invokeProcessElement(DoFnRunner.java:167)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.processElement(DoFnRunner.java:152)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase.processElement(ParDoFnBase.java:188)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:147)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:152)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase$1.output(ParDoFnBase.java:117)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnContext.outputWindowedValue(DoFnRunner.java:314)
at com.google.cloud.dataflow.sdk.util.DoFnRunner$DoFnProcessContext.output(DoFnRunner.java:475)
at com.outfit7.dataflow.ante.Example$5.processElement(Example.java:41)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.invokeProcessElement(DoFnRunner.java:167)
at com.google.cloud.dataflow.sdk.util.DoFnRunner.processElement(DoFnRunner.java:152)
at com.google.cloud.dataflow.sdk.runners.worker.ParDoFnBase.processElement(ParDoFnBase.java:188)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:147)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:171)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:117)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:167)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:134)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:146)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:131)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
有什么方法可以确保工作顺利完成吗?我应该如何设置num个worker和diskSizeGb,或者估算一个worker上使用的大小? GroupByKey是在一个worker上执行还是共享/分片?据我所知,GroupByKey&#34;等待&#34;在将PCollection传递给管道中的下一个元素之前处理所有数据。
答案 0 :(得分:1)
来自https://cloud.google.com/dataflow/faq#question-45:
此错误表示您的本地磁盘没有足够的空间来处理此作业。如果使用默认设置运行作业,则作业将在3个工作程序上运行,每个工作程序具有250 GB的本地磁盘空间,并且不进行自动扩展。考虑modifying the default settings增加工作可用的工作者数量,增加每个工作者的默认磁盘大小,或启用自动缩放。