数据流作业失败,出现HttpError,NotImplementedError

时间:2020-07-08 23:51:20

标签: python-3.x google-cloud-platform google-cloud-dataflow

我正在运行一个我认为应该可以工作的数据流作业,并且在1.5个小时后由于出现网络错误而失败了。对部分数据运行时效果很好。

第一个故障标志是一串完整的警告,如下所示:

Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f2bcb629950> at b'\xa4r\xa6\x85\x00\x01': proposed split position is out of range [b'\xa4^E\xd2\x00\x01', b'\xa4r\xa6\x85\x00\x01'). Position of last group processed was b'\xa4r\xa6\x84\x00\x01'.

然后有四个错误似乎与将CSV文件写入GCS有关:

Error in _start_upload while inserting file gs://(redacted).csv: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 565, in _start_upload self._client.objects.Insert(self._insert_request, upload=self._upload) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1156, in Insert upload=upload, upload_config=upload_config) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse http_response, method_config=method_config, request=request) apitools.base.py.exceptions.HttpError: HttpError accessing <https://www.googleapis.com/resumable/upload/storage/v1/b/(redacted).csv&uploadType=resumable&upload_id=(redacted)>: response: <{'content-type': 'text/plain; charset=utf-8', 'x-guploader-uploadid': '(redacted)', 'content-length': '0', 'date': 'Wed, 08 Jul 2020 22:17:28 GMT', 'server': 'UploadServer', 'status': '503'}>, content <>

Error in _start_upload while inserting file gs://(redacted).csv: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 565, in _start_upload self._client.objects.Insert(self._insert_request, upload=self._upload) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1156, in Insert upload=upload, upload_config=upload_config) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 715, in _RunMethod http_request, client=self.client) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 908, in InitializeUpload return self.StreamInChunks() File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 1020, in StreamInChunks additional_headers=additional_headers) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 971, in __StreamMedia self.RefreshResumableUploadState() File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 873, in RefreshResumableUploadState self.stream.seek(self.progress) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py", line 301, in seek offset, whence, self.position, self.last_block_position)) NotImplementedError: offset: 0, whence: 0, position: 411, last: 411

数据流作业ID为2020-07-07_13_13_08_31-7649894576933400587-如果来自Google Cloud Support的任何人都能看到此信息,我将不胜感激。非常感谢。

P.S我去年(Dataflow job fails at BigQuery write with backend errors)提出了类似的问题,解决方法是使用--experiments = use_beam_bq_sink-我已经在这样做了。

1 个答案:

答案 0 :(得分:0)

您可以放心地忽略“拒绝拆分”错误。这仅意味着在工作人员已经读取该位置数据流服务之后,该工作人员可能已收到该位置数据流服务。因此,工作人员必须忽略拆分请求。

错误“插入时_start_upload中出现错误”似乎更成问题,并且似乎与https://issues.apache.org/jira/browse/BEAM-7014类似。我怀疑这是一种罕见的剥落,因此我不确定这是否是您工作失败的原因(同一项工作项仅失败了四次而失败)。

您可以联系Google Cloud支持人员,以便他们调查您的工作吗?

我将在JIRA中提及这一点。