内存分配失败:增长缓冲区 - Python

时间:2017-03-05 12:14:27

标签: python html multithreading threadpool python-multithreading

我正在编写一个可以抓取数千个不同网页的脚本。由于这些页面通常不同(具有不同的站点),因此我使用多线程来加速抓取。

编辑:简单的简短说明

-------

我在一个300名工人的游泳池中装载了300个网址(htmls)。由于html的大小是可变的,有时,大小的总和可能太大而python引发:internal buffer error : Memory allocation failed : growing buffer。我想以某种方式检查是否会发生这种情况,是否等到缓冲区未满。

-------

这种方法有效,但有时候,python开始抛出:

internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer

进入控制台。我想这是因为html的大小我存储在内存中,可以是300 *(例如1mb)= 300mb

修改

我知道我可以减少工人数量而且我会。但它不是一个解决方案,只有较低的机会得到这样的错误。我想完全避免这个错误...

我开始记录html尺寸:

ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))

结果是(部分):

2017-03-05 13:02:04,914 DEBUG SIZE: 243940
2017-03-05 13:02:05,023 DEBUG SIZE: 138384
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715
2017-03-05 13:02:05,213 DEBUG SIZE: 291415
2017-03-05 13:02:05,213 DEBUG SIZE: 287030
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751
2017-03-05 13:02:05,234 DEBUG SIZE: 359193
2017-03-05 13:02:05,247 DEBUG SIZE: 23703
2017-03-05 13:02:05,252 DEBUG SIZE: 24606
2017-03-05 13:02:05,275 DEBUG SIZE: 302388
2017-03-05 13:02:05,329 DEBUG SIZE: 334925

这是我简化的抓取方法:

def scrape_chunk(chunk):
    pool = Pool(300)
    results = pool.map(scrape_chunk_item, chunk)
    pool.close()
    pool.join()
    return results

def scrape_chunk_item(item):
    root_result = _load_root(item.get('url'))
    # parse using xpath and return

加载html的功能:

def _load_root(url):
    for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):
        try:
            headers = requests.utils.default_headers()
            headers['User-Agent'] = ua.chrome
            r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, )
            r.raise_for_status()
        except requests.Timeout as e:

            if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:
                tb = traceback.format_exc()
                return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}
        except Exception:
            tb = traceback.format_exc()
            return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}
        else:
            break

    r.encoding = 'utf-8'
    html = r.content
    ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
    try:
        root = etree.fromstring(html, etree.HTMLParser())
    except Exception:
        tb = traceback.format_exc()
        return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}

    return {'success': True, 'root': root}

你知道如何安全吗?如果有缓冲区溢出问题,会让工人等待的东西?

1 个答案:

答案 0 :(得分:1)

只有当X内存可用时,您才可以限制每个工作人员启动... 未经过测试

15.10

total_mem也可以自动计算,因此您不必猜测每台机器的正确值...