python-3.x - 数据流作业失败，出现HttpError，NotImplementedError

我正在运行一个我认为应该可以工作的数据流作业，并且在1.5个小时后由于出现网络错误而失败了。对部分数据运行时效果很好。

第一个故障标志是一串完整的警告，如下所示：

Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f2bcb629950> at b'\xa4r\xa6\x85\x00\x01': proposed split position is out of range [b'\xa4^E\xd2\x00\x01', b'\xa4r\xa6\x85\x00\x01'). Position of last group processed was b'\xa4r\xa6\x84\x00\x01'.

然后有四个错误似乎与将CSV文件写入GCS有关：

Error in _start_upload while inserting file gs://(redacted).csv: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 565, in _start_upload self._client.objects.Insert(self._insert_request, upload=self._upload) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1156, in Insert upload=upload, upload_config=upload_config) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse http_response, method_config=method_config, request=request) apitools.base.py.exceptions.HttpError: HttpError accessing <https://www.googleapis.com/resumable/upload/storage/v1/b/(redacted).csv&uploadType=resumable&upload_id=(redacted)>: response: <{'content-type': 'text/plain; charset=utf-8', 'x-guploader-uploadid': '(redacted)', 'content-length': '0', 'date': 'Wed, 08 Jul 2020 22:17:28 GMT', 'server': 'UploadServer', 'status': '503'}>, content <>

Error in _start_upload while inserting file gs://(redacted).csv: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 565, in _start_upload self._client.objects.Insert(self._insert_request, upload=self._upload) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1156, in Insert upload=upload, upload_config=upload_config) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 715, in _RunMethod http_request, client=self.client) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 908, in InitializeUpload return self.StreamInChunks() File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 1020, in StreamInChunks additional_headers=additional_headers) File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 971, in __StreamMedia self.RefreshResumableUploadState() File "/usr/local/lib/python3.7/site-packages/apitools/base/py/transfer.py", line 873, in RefreshResumableUploadState self.stream.seek(self.progress) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py", line 301, in seek offset, whence, self.position, self.last_block_position)) NotImplementedError: offset: 0, whence: 0, position: 411, last: 411

数据流作业ID为2020-07-07_13_13_08_31-7649894576933400587-如果来自Google Cloud Support的任何人都能看到此信息，我将不胜感激。非常感谢。

P.S我去年（Dataflow job fails at BigQuery write with backend errors）提出了类似的问题，解决方法是使用--experiments = use_beam_bq_sink-我已经在这样做了。

数据流作业失败，出现HttpError，NotImplementedError

1 个答案: