我有一个通用的Spider类,它使用不同的url列表(不同的域)进行实例化。
因此,对于example.com
,有一个蜘蛛实例,amazon.com
另一个。
GenericSpider
已将DOWNLOAD_DELAY
设置为0.5秒,以防止被禁止或过载某人。
每只蜘蛛平均有5个网址(有时是50个,有时是1个网址)
我可以在日志中看到有很多Timeout错误。是否可能是蜘蛛设置有问题?我可以在日志中看到数百个这样的错误。超过一半的请求以GIVE UP重试结束。
编辑:当我降低蜘蛛数量时,几乎没有超时(不是50%的请求,而是0.5%)
我只下载了html,所以在15秒内下载它应该没有问题。
在日志中,有数百个:
2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying ... User timeout caused connection failure.
2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET ...> (failed 3 times): User timeout caused connection failure: Getting ... took longer than 15 seconds..
这是我的蜘蛛:
class GenericSpider(scrapy.Spider):
download_timeout = 15
name = 'will_be_overriden'
custom_settings = {'CONCURRENT_REQUESTS': 10,
'DOWNLOAD_DELAY':0.5}
def __init__(self, occs_occurence_scanning_id_map_dict):
super(GenericScraper,self).__init__()
occs = occs_occurence_scanning_id_map_dict[0]
self.occurence_scanning_id_map_dict = occs_occurence_scanning_id_map_dict[1]
self.name = occs[0].site.name
self.occs = occs
self.xpath = self.occs[0].site.xpaths.first().xpath
def start_requests(self):
for occ in self.occs:
yield scrapy.Request(url=occ.url, callback=self.parse)
def parse(self, response):
successlog.debug('GOT RESPONSE')
hxs = HtmlXPathSelector(response)
text_raw = hxs.select(self.xpath+'/text()')
# DO SOME STUFF IN DATABASE
这就是我运行抓取工具的方式:
def run_spiders():
from scrapy.crawler import CrawlerProcess
...
ua = UserAgent()
process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
"EXTENSIONS": {
'scrapy.telnet.TelnetConsole': None
},
"LOG_FILE":'scrapylog.log',
"CONCURRENT_REQUESTS":10,
"ROBOTSTXT_OBEY":False,
"USER_AGENT":ua.chrome,
})
for s in Site.objects.all():
occs = s.occurences.all()
if occs:
occs_occurence_scanning_id_map_dict = (occs, occurence_scanning_id_map_dict)
process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)
process.start()