我正在编写一个可以抓取数千个不同网页的脚本。由于这些页面通常不同(具有不同的站点),因此我使用多线程来加速抓取。
编辑:简单的简短说明
-------
我在一个300名工人的游泳池中装载了300个网址(htmls)。由于html的大小是可变的,有时,大小的总和可能太大而python引发:internal buffer error : Memory allocation failed : growing buffer
。我想以某种方式检查是否会发生这种情况,是否等到缓冲区未满。
-------
这种方法有效,但有时候,python开始抛出:
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
进入控制台。我想这是因为html
的大小我存储在内存中,可以是300 *(例如1mb)= 300mb
修改
我知道我可以减少工人数量而且我会。但它不是一个解决方案,只有较低的机会得到这样的错误。我想完全避免这个错误...
我开始记录html
尺寸:
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
结果是(部分):
2017-03-05 13:02:04,914 DEBUG SIZE: 243940
2017-03-05 13:02:05,023 DEBUG SIZE: 138384
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715
2017-03-05 13:02:05,213 DEBUG SIZE: 291415
2017-03-05 13:02:05,213 DEBUG SIZE: 287030
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751
2017-03-05 13:02:05,234 DEBUG SIZE: 359193
2017-03-05 13:02:05,247 DEBUG SIZE: 23703
2017-03-05 13:02:05,252 DEBUG SIZE: 24606
2017-03-05 13:02:05,275 DEBUG SIZE: 302388
2017-03-05 13:02:05,329 DEBUG SIZE: 334925
这是我简化的抓取方法:
def scrape_chunk(chunk):
pool = Pool(300)
results = pool.map(scrape_chunk_item, chunk)
pool.close()
pool.join()
return results
def scrape_chunk_item(item):
root_result = _load_root(item.get('url'))
# parse using xpath and return
加载html的功能:
def _load_root(url):
for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):
try:
headers = requests.utils.default_headers()
headers['User-Agent'] = ua.chrome
r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, )
r.raise_for_status()
except requests.Timeout as e:
if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}
else:
break
r.encoding = 'utf-8'
html = r.content
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
try:
root = etree.fromstring(html, etree.HTMLParser())
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}
return {'success': True, 'root': root}
你知道如何安全吗?如果有缓冲区溢出问题,会让工人等待的东西?
答案 0 :(得分:1)
只有当X内存可用时,您才可以限制每个工作人员启动... 未经过测试
15.10
total_mem也可以自动计算,因此您不必猜测每台机器的正确值...