我正在努力使Scrapy广泛爬行。目标是在不同的域中进行许多并发爬网,但同时在每个域上轻轻爬网。因此能够保持良好的爬行速度并保持每个URL的请求频率低。
这是我使用的蜘蛛:
import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem
class testSpider(CrawlSpider):
name = "testCrawler16"
start_urls = [
"http://example.com",
]
extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
allow=('.se'))
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_links(self, response):
item = MyprojectItem()
item['url'] =response.url
item['depth'] = response.meta['depth']
yield item
以下是我使用的设置:
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
DEPTH_LIMIT = 10
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
问题是,经过一段时间后,爬虫爬行的次数越来越少,并且只在少数域之间交替,有时只有一个域。因此,自动油门会降低爬行速度。我如何让蜘蛛保持并发性,并与许多域有许多单独的连接,并使用并发来保持速度,同时保持每个域的请求率低?
答案 0 :(得分:2)
AUTOTHROTTLE_ENABLED
用于快速抓取,我建议将其设置为False
,然后轻轻地自行抓取。
此处您需要的唯一设置是CONCURRENT_REQUESTS
和CONCURRENT_REQUESTS_PER_DOMAIN
以及DOWNLOAD_DELAY
。
将DOWNLOAD_DELAY
设置为您希望每个域抓取每个请求的值10
,例如,如果您每分钟需要6个请求(每10
秒一个)。
将CONCURRENT_REQUESTS_PER_DOMAIN
设置为1以尊重每个域的前一个DOWNLOAD_DELAY
间隔。
将CONCURRENT_REQUESTS
设置为较高的值,它可能是您计划抓取(或更高)的域数。这只是因为它不会干扰以前的设置。
答案 1 :(得分:0)
您可以在以域为键的字典中添加时间戳,而只需抓住最小的数字(最旧的)。然后,从列表中弹出URL或将列表设为全局,然后从管道中删除。
class PoliteSpider(Spider):
name = 'polite_spider'
allowed_urls = ['*']
custom_settings = {
'ITEM_PIPELINES': {
'pipelines.MainPipeline': 90,
},
'CONCURRENT_REQUESTS': 200,
'CONCURRENT_REQUESTS_PER_DOMAIN': 25,
'ROBOTSTXT_OBEY': False,
'CONCURRENT_ITEMS': 100,
'REACTOR_THREADPOOL_MAXSIZE': 400,
# Hides printing item dicts
'LOG_LEVEL': 'INFO',
'RETRY_ENABLED': False,
'REDIRECT_MAX_TIMES': 1,
# Stops loading page after 5mb
'DOWNLOAD_MAXSIZE': 5592405,
# Grabs xpath before site finish loading
'DOWNLOAD_FAIL_ON_DATALOSS': False
}
def __init__(self):
self.links = ['www.test.com', 'www.different.org', 'www.pogostickaddict.net']
self.domain_count = {}
def start_requests(self):
while self.links:
start_time = time.time()
url = next(x for x in self.links if min(domain_count, key=domain_count.get) in x)
request = scrapy.Request(url, callback=self.parse, dont_filter=True,
meta={'time': time.time(), 'url': url})
yield request
def parse(self, response):
domain = response.url.split('//')[-1].split('/')[0]
self.domain_count[domain] = time.time()
pageloader = PageItemLoader(PageItem(), response=response)
pageloader.add_xpath('search_results', '//div[1]/text()')
self.links.remove(response.meta['url'])
yield pageloader.load_item()
小例子:
import time test = {'www.test.com': 1, 'www.different.org': 2, 'www.pogostickaddict.net': 3} links = ['www.test.com/blog', 'www.different.org/login', 'www.pogostickaddict.net/store/checkout'] url = next(x for x in links if max(test, key=test.get) in x) print(time.time()) print(links) print(url) links.remove(url) print(links) print(time.time())
1549868251.3280149
['www.test.com/blog'、'www.different.org/login'、'www.pogostickaddict.net/store/checkout']
www.pogostickaddict.net/store/checkout
['www.test.com/blog','www.different.org/login']
1549868251.328043