几次410错误后(写入GCS时)数据流作业失败

时间:2016-12-19 03:53:25

标签: google-cloud-storage google-cloud-dataflow

我在8月份发现了类似的SO问题,这或多或少是我的团队最近在数据流管道方面遇到的问题。 How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone

这是一个例外(在约1小时的范围内抛出了410个异常,但我只粘贴了最后一个)

(9f012f4bc185d790): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
    at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
    at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:97)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: java.nio.channels.ClosedChannelException
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.throwIfNotOpen(AbstractGoogleAsyncWriteChannel.java:408)
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:286)
        at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
        at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.abort(WriteOperation.java:112)
        at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:86)
        ... 10 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
    ... 4 more
 2016-12-18 (18:58:58) Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/D...
(d3b59c20088d726e): Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/DataflowPipelineRunner.BatchWrite/Finalize failed., (2f2e396b6ba3aaa2): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: userprofile-lifetime-diff-12181745-c04d-harness-0xq5, userprofile-lifetime-diff-12181745-c04d-harness-adga, userprofile-lifetime-diff-12181745-c04d-harness-cz30, userprofile-lifetime-diff-12181745-c04d-harness-lt9m

这是工作ID:2016-12-18_17_45_23-2873252511714037422

我正在使用指定的分片数重新运行相同的工作(根据我之前提到的另一个SO问题的答案,这个作业每天运行4000次,通常输出~4k文件)。 是否有理由将分片数量限制在10k以下的数量有帮助?如果需要,了解这一点对我们重新设计管道非常有用。

此外,当指定分片数量时,作业所花费的时间比未指定分片的时间长(主要是因为写入GCS的步骤) - 就$而言,这项工作通常需要75美元-80(我们每天运行这个工作),而当我指定分片的数量时,它的成本为130-140美元(增加了74%)(其他步骤似乎已经运行了相同的持续时间,或多或少 - 工作ID是2016-12-18_19_30_32-7274262445792076535)。因此,如果可能的话,我们真的希望避免必须指定分片数量。

非常感谢任何帮助和建议!

- 跟进 当我在输出目录中尝试'gsutil ls'时,这个作业的输出似乎正在消失然后出现在GCS中,甚至在作业完成后10个多小时。这可能是一个相关的问题,但我在这里创建了一个单独的问题("gsutil ls" shows a different list every time)。

2 个答案:

答案 0 :(得分:2)

是 - 指定分片数量会对Dataflow执行作业的方式施加额外约束,并可能影响性能。例如,dynamic work rebalancing与固定数量的分片不兼容。

据我了解,var apiRoutes = express.Router(); app.use('/api', apiRoutes); apiRoutes.post('/scheduler/submit', function (req, res) { agenda.define('test', function () { console.log('Test'); }); agenda.on('ready', function () { agenda.every('*/1 * * * *', 'test'); agenda.start(); }); }); 是临时GCS问题,通常Dataflow的重试机制可以解决它。但是,如果它以太高的速率发生,它有可能使工作失败。在批处理模式下,如果单个捆绑包失败4次,Dataflow将使作业失败。

答案 1 :(得分:0)

这项工作中发生的410失误似乎是无害的(他们已成功重试)。

这项工作需要更长的时间,AFAIK,不是因为本身的分片,而是由于工人和#34;失去与服务的联系"这是因为碰到了RPC配额问题,我认为弗朗西斯说这已经解决了。

请注意,如果您不禁用自动缩放功能,您的工作便宜得多,因为Dataflow将被允许关闭闲置的工作人员。

如果您仍然遇到固定分片的性能问题,请与我们联系!