DataFlow Worker运行时错误

时间:2018-03-13 22:37:16

标签: google-cloud-platform google-cloud-dataflow apache-beam

我正在运行数据流作业(jobid:2018-03-13_13_21_09-13427670439683219454)。

作业在1小时后停止运行,并显示以下错误消息:

(f38a2b0cb8c28493): Workflow failed. Causes: [...] 
(6bf57c531051aa32): A work item was attempted 4 times without success. 
Each time the worker eventually lost contact with the service. 
The work item was attempted on: [...]

我已经成功地在不同的数据上运行相同的工作但是这些数据似乎有些不同。

我在堆栈驱动程序中找不到任何明显的错误消息,除了这个消息之外似乎是提供信息的:

Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 778, in run deferred_exception_details=deferred_exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 630, in do_work exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 175, in wrapper return fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 491, in report_completion_status exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 299, in report_status work_executor=self._work_executor) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 359, in report_status self._client.projects_locations_jobs_workItems.ReportStatus(request)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 557, in ReportStatus config, request, global_params=global_params) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 722, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse http_response, method_config=method_config, request=request) HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/wikidetox-viz/locations/us-central1/jobs/2018-03-13_15_14_56-7727174963497501590/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '356', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 13 Mar 2018 22:48:10 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(99f9b99d6c881f47): Failed to publish the result of the work update. Causes: (99f9b99d6c881644): Failed to update work status. Causes: (e00f9cd76af5eb): Failed to update work status., (e00f9cd76afcd5): Work \"5154713722864856696\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >

无论如何我可以调试这个吗?

更新

升级云存储包后,我使用此作业ID运行它(2018-03-13_19_26_59-7765405222195746041)

我得到的错误似乎是一个apache beam写入错误。

(3c513425829099bc): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
    work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 170, in execute
    op.finish()
File "apache_beam/runners/worker/operations.py", line 334, in apache_beam.runners.worker.operations.DoOperation.finish
    def finish(self):
File "apache_beam/runners/worker/operations.py", line 335, in apache_beam.runners.worker.operations.DoOperation.finish
    with self.scoped_finish_state:
File "apache_beam/runners/worker/operations.py", line 336, in apache_beam.runners.worker.operations.DoOperation.finish
    self.dofn_runner.finish()
File "apache_beam/runners/common.py", line 411, in apache_beam.runners.common.DoFnRunner.finish
    self._invoke_bundle_method(self.do_fn_invoker.invoke_finish_bundle)
File "apache_beam/runners/common.py", line 402, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method
    self._reraise_augmented(exn)
File "apache_beam/runners/common.py", line 431, in apache_beam.runners.common.DoFnRunner._reraise_augmented
    raise new_exn, None, original_traceback
File "apache_beam/runners/common.py", line 400, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method
    bundle_method()
File "apache_beam/runners/common.py", line 174, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle
    def invoke_finish_bundle(self):
File "apache_beam/runners/common.py", line 177, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle
    self.output_processor.finish_bundle_outputs(
File "apache_beam/runners/common.py", line 500, in apache_beam.runners.common._OutputProcessor.finish_bundle_outputs
    for result in results:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.py", line 969, in finish_bundle
    yield WindowedValue(self.writer.close(), window.MAX_TIMESTAMP,
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", line 302, in close
    self.sink.close(self.temp_handle)
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", line 144, in close
    file_handle.close()
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 863, in close
    self._flush_write_buffer()
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 896, in _flush_write_buffer
    raise self.upload_thread.last_error  # pylint: disable=raising-bad-type
RuntimeError: BadStatusCodeError: HttpError accessing <https://www.googleapis.com/resumable/upload/storage/v1/b/wikidetox-viz-dataflow/o?uploadType=resumable&alt=json&upload_id=AEnB2UrkJEFlq2t-c9_Zpo_NuYip7Z6yFU12xq4bRtOTRtFPJ0GOhBJ9WhnuYTkR9vsbi59izn1ifO3h5-hc6oHECMD3tFLidQ&name=bakup%2Freconstruction-from-shortpages-pages-week20year2012%2Fbeam-temp-last_rev-9f8513e0273011e8803b42010a80003d%2F365af941-49ae-4867-8793-5e5858dbf048.last_rev>: response: <{'status': '503', 'content-length': '19', 'server': 'UploadServer', 'x-guploader-uploadid': 'AEnB2Ur4-2Nih8b2X2VbCFUcaKviZt7nCj9OYrZ6lfCP_ne5EegXEw5ZJayGLGwg9ix9Xdle1TcSOhFR-T6qGoc63A0zHhL2Qw', 'date': 'Wed, 14 Mar 2018 02:49:35 GMT', 'content-type': 'text/html; charset=UTF-8'}>, content <Service Unavailable> [while running 'WriteBackInput_last_rev/Write/WriteImpl/WriteBundles/WriteBundles']

提前致谢。

1 个答案:

答案 0 :(得分:0)

通常像python写入这样的HTTP错误是由瞬态事物引起的,应该自动重试。您的工作失败,因为同一工作项失败了4次。所以我会在其中一个失败的工人的工作日志中查找不同的错误。

您可能会寻找一个常见的错误,如下所示:

工作项目的进度报告者线程无法在过去460秒内向Dataflow服务发送成功的进度报告。这可能是由于(1)高工作者内存使用率,(2)阻止进度报告器线程被正确配置的用户代码,或(3)与Dataflow服务相关的其他问题。延迟进度报告可能导致当前工作项目的租约到期。如果同一工作项多次到期,则数据流作业可能会失败。

此错误表示您的工作人员内存不足或已停顿且无法回复进度更新。对工作人员的内存或CPU配置文件可能是好的。