我正在一个项目中,需要从S3容器中下载特定URL的爬网数据(从CommonCrawl),然后处理该数据。
当前,我有一个MapReduce作业(通过Hadoop Streaming进行Python处理),该作业可以获取URL列表的正确S3文件路径。然后,我尝试通过从commoncrawl S3存储桶下载数据来使用第二个MapReduce作业来处理此输出。在映射器中,我使用boto3从commoncrawl S3存储桶中下载特定URL的gzip内容,然后输出有关gzip内容的一些信息(字计数器信息,内容长度,链接到的URL等)。然后,reducer会通过此输出获得最终的字数,URL列表等。
第一个MapReduce作业的输出文件只有大约6mb的大小(但是一旦我们缩放到完整的数据集,它将更大)。当我运行第二个MapReduce时,此文件仅被拆分两次。通常对于这么小的文件来说这不是问题,但是我上面描述的映射器代码(获取S3数据,吐出映射的输出等)需要花费一些时间才能为每个URL运行。由于文件仅拆分两次,因此仅运行2个映射器。我需要增加分割数,以便可以更快地完成映射。
我曾尝试为MapReduce作业设置“ mapreduce.input.fileinputformat.split.maxsize”和“ mapreduce.input.fileinputformat.split.minsize”,但它不会更改拆分的数量。
以下是映射器中的一些代码:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
'Body'].read()
fileobj = io.BytesIO(gz_file)
with gzip.open(fileobj, 'rb') as file:
[do stuff]
我还手动将输入文件分成最多100行的多个文件。这具有给我更多映射器的预期效果,但是随后我开始从s3client.get_object()调用中遇到ConnectionError:
Traceback (most recent call last):
File "dmapper.py", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "dmapper.py", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
我目前仅使用少数几个URL来运行它,但是一旦它起作用,我将需要使用数千个URL(每个都有许多子目录)来实现。
我不确定从哪里开始修复此问题。我觉得很可能有比我尝试的方法更好的方法。每个网址的映射器似乎都花了很长时间,这一事实似乎表明我正在接近此错误。我还应该提到,如果直接作为管道命令运行,则映射器和化简器都可以正常运行:
“ cat short_url_list.txt | python mapper.py | sort | python reducer.py”->产生所需的输出,但是要花费很长时间才能在整个URL列表上运行。
任何指导将不胜感激。
答案 0 :(得分:0)
MapReduce API提供了NLineInputFormat。属性“ mapreduce.input.lineinputformat.linespermap”允许控制最多将多少行(这里是WARC记录)传递给映射器。与mrjob一起使用,请参阅。伊利亚的WARC indexer。
关于S3连接错误:最好在数据所在的us-east-1 AWS区域中运行作业。