我尝试组织最多10次并发下载的池。该函数应该下载基本URL,然后解析该页面上的所有URL并下载每个URL,但是并发下载的总数不应超过10个。
from lxml import etree
import gevent
from gevent import monkey, pool
import requests
monkey.patch_all()
urls = [
'http://www.google.com',
'http://www.yandex.ru',
'http://www.python.org',
'http://stackoverflow.com',
# ... another 100 urls
]
LINKS_ON_PAGE=[]
POOL = pool.Pool(10)
def parse_urls(page):
html = etree.HTML(page)
if html:
links = [link for link in html.xpath("//a/@href") if 'http' in link]
# Download each url that appears in the main URL
for link in links:
data = requests.get(link)
LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))
def get_base_urls(url):
# Download the main URL
data = requests.get(url)
parse_urls(data.content)
如何组织它以并发方式,但保持所有Web请求的一般全局池限制?
答案 0 :(得分:4)
gevent.pool将限制并发的greenlets,而不是连接。
您应该将session与HTTPAdapter
一起使用connection_limit = 10 adapter = requests.adapters.HTTPAdapter(pool_connections=connection_limit, pool_maxsize=connection_limit) session = requests.session() session.mount('http://', adapter) session.get('some url') # or do your work with gevent from gevent.pool import Pool # it should bigger than connection limit if the time of processing data # is longer than downings, # to give a change run processing. pool_size = 15 pool = Pool(pool_size) for url in urls: pool.spawn(session.get, url)
答案 1 :(得分:4)
我认为以下内容应该能满足您的需求。我在我的例子中使用的是BeautifulSoup,而不是链接条带化的东西。
from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()
jobs = []
links = []
p = pool.Pool(10)
urls = [
'http://www.google.com',
# ... another 100 urls
]
def get_links(url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.text)
links + soup.find_all('a')
for url in urls:
jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)
答案 2 :(得分:0)
您应该使用gevent.queue以正确的方式执行此操作。
同样this(eventlet examples)将有助于您理解基本理念。
Gevent解决方案类似于eventlet。
请记住,有一些地方可以存储访问过的网址,以免出现循环,因此您不会出现内存错误,需要引入一些限制。