Question

我正在一个项目中，需要从S3容器中下载特定URL的爬网数据（从CommonCrawl），然后处理该数据。

当前，我有一个MapReduce作业（通过Hadoop Streaming进行Python处理），该作业可以获取URL列表的正确S3文件路径。然后，我尝试通过从commoncrawl S3存储桶下载数据来使用第二个MapReduce作业来处理此输出。在映射器中，我使用boto3从commoncrawl S3存储桶中下载特定URL的gzip内容，然后输出有关gzip内容的一些信息（字计数器信息，内容长度，链接到的URL等）。然后，reducer会通过此输出获得最终的字数，URL列表等。

第一个MapReduce作业的输出文件只有大约6mb的大小（但是一旦我们缩放到完整的数据集，它将更大）。当我运行第二个MapReduce时，此文件仅被拆分两次。通常对于这么小的文件来说这不是问题，但是我上面描述的映射器代码（获取S3数据，吐出映射的输出等）需要花费一些时间才能为每个URL运行。由于文件仅拆分两次，因此仅运行2个映射器。我需要增加分割数，以便可以更快地完成映射。

我曾尝试为MapReduce作业设置“ mapreduce.input.fileinputformat.split.maxsize”和“ mapreduce.input.fileinputformat.split.minsize”，但它不会更改拆分的数量。

以下是映射器中的一些代码：

s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1

gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
        'Body'].read()

fileobj = io.BytesIO(gz_file)

with gzip.open(fileobj, 'rb') as file:
    [do stuff]

我还手动将输入文件分成最多100行的多个文件。这具有给我更多映射器的预期效果，但是随后我开始从s3client.get_object（）调用中遇到ConnectionError：

Traceback (most recent call last):
  File "dmapper.py", line 103, in <module>
    commoncrawl_reader(base_url, full_url, offset, length, warc_file)
  File "dmapper.py", line 14, in commoncrawl_reader
    gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
    operation_model, request_dict)
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
    success_response, exception):
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
    return self._emit(event_name, kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
    response = handler(**kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
    proxies=self.proxies, timeout=self.timeout)
  File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

我目前仅使用少数几个URL来运行它，但是一旦它起作用，我将需要使用数千个URL（每个都有许多子目录）来实现。

我不确定从哪里开始修复此问题。我觉得很可能有比我尝试的方法更好的方法。每个网址的映射器似乎都花了很长时间，这一事实似乎表明我正在接近此错误。我还应该提到，如果直接作为管道命令运行，则映射器和化简器都可以正常运行：

“ cat short_url_list.txt | python mapper.py | sort | python reducer.py”->产生所需的输出，但是要花费很长时间才能在整个URL列表上运行。

任何指导将不胜感激。

Answer 1

MapReduce API提供了NLineInputFormat。属性“ mapreduce.input.lineinputformat.linespermap”允许控制最多将多少行（这里是WARC记录）传递给映射器。与mrjob一起使用，请参阅。伊利亚的WARC indexer。

关于S3连接错误：最好在数据所在的us-east-1 AWS区域中运行作业。

使用Hadoop Streaming和MapReduce处理来自CommonCrawl的许多WARC存档

1 个答案: