使用Hadoop Streaming和MapReduce处理来自CommonCrawl的许多WARC存档

时间:2018-08-13 23:13:12

标签: mapreduce boto3 hadoop-streaming common-crawl

我正在一个项目中,需要从S3容器中下载特定URL的爬网数据(从CommonCrawl),然后处理该数据。

当前,我有一个MapReduce作业(通过Hadoop Streaming进行Python处理),该作业可以获取URL列表的正确S3文件路径。然后,我尝试通过从commoncrawl S3存储桶下载数据来使用第二个MapReduce作业来处理此输出。在映射器中,我使用boto3从commoncrawl S3存储桶中下载特定URL的gzip内容,然后输出有关gzip内容的一些信息(字计数器信息,内容长度,链接到的URL等)。然后,reducer会通过此输出获得最终的字数,URL列表等。

第一个MapReduce作业的输出文件只有大约6mb的大小(但是一旦我们缩放到完整的数据集,它将更大)。当我运行第二个MapReduce时,此文件仅被拆分两次。通常对于这么小的文件来说这不是问题,但是我上面描述的映射器代码(获取S3数据,吐出映射的输出等)需要花费一些时间才能为每个URL运行。由于文件仅拆分两次,因此仅运行2个映射器。我需要增加分割数,以便可以更快地完成映射。

我曾尝试为MapReduce作业设置“ mapreduce.input.fileinputformat.split.maxsize”和“ mapreduce.input.fileinputformat.split.minsize”,但它不会更改拆分的数量。

以下是映射器中的一些代码:

s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1

gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
        'Body'].read()

fileobj = io.BytesIO(gz_file)

with gzip.open(fileobj, 'rb') as file:
    [do stuff]

我还手动将输入文件分成最多100行的多个文件。这具有给我更多映射器的预期效果,但是随后我开始从s3client.get_object()调用中遇到ConnectionError:

Traceback (most recent call last):
  File "dmapper.py", line 103, in <module>
    commoncrawl_reader(base_url, full_url, offset, length, warc_file)
  File "dmapper.py", line 14, in commoncrawl_reader
    gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
    operation_model, request_dict)
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
    success_response, exception):
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
    return self._emit(event_name, kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
    response = handler(**kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
    proxies=self.proxies, timeout=self.timeout)
  File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

我目前仅使用少数几个URL来运行它,但是一旦它起作用,我将需要使用数千个URL(每个都有许多子目录)来实现。

我不确定从哪里开始修复此问题。我觉得很可能有比我尝试的方法更好的方法。每个网址的映射器似乎都花了很长时间,这一事实似乎表明我正在接近此错误。我还应该提到,如果直接作为管道命令运行,则映射器和化简器都可以正常运行:

“ cat short_url_list.txt | python mapper.py | sort | python reducer.py”->产生所需的输出,但是要花费很长时间才能在整个URL列表上运行。

任何指导将不胜感激。

1 个答案:

答案 0 :(得分:0)

MapReduce API提供了NLineInputFormat。属性“ mapreduce.input.lineinputformat.linespermap”允许控制最多将多少行(这里是WARC记录)传递给映射器。与mrjob一起使用,请参阅。伊利亚的WARC indexer

关于S3连接错误:最好在数据所在的us-east-1 AWS区域中运行作业。