Scrapy - 太多超时(超过一半)

时间:2017-03-31 09:26:03

标签: python web-scraping scrapy timeout scrapy-spider

我有一个通用的Spider类,它使用不同的url列表(不同的域)进行实例化。

因此,对于example.com,有一个蜘蛛实例,amazon.com另一个。

GenericSpider已将DOWNLOAD_DELAY设置为0.5秒,以防止被禁止或过载某人。

每只蜘蛛平均有5个网址(有时是50个,有时是1个网址)

我可以在日志中看到有很多Timeout错误。是否可能是蜘蛛设置有问题?我可以在日志中看到数百个这样的错误。超过一半的请求以GIVE UP重试结束。

编辑:当我降低蜘蛛数量时,几乎没有超时(不是50%的请求,而是0.5%)

我只下载了html,所以在15秒内下载它应该没有问题。

在日志中,有数百个:

2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying ... User timeout caused connection failure.
2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET ...> (failed 3 times): User timeout caused connection failure: Getting ... took longer than 15 seconds..

这是我的蜘蛛:

class GenericSpider(scrapy.Spider):
    download_timeout = 15
    name = 'will_be_overriden'
    custom_settings = {'CONCURRENT_REQUESTS': 10,
                       'DOWNLOAD_DELAY':0.5}
    def __init__(self, occs_occurence_scanning_id_map_dict):
        super(GenericScraper,self).__init__()
        occs = occs_occurence_scanning_id_map_dict[0]
        self.occurence_scanning_id_map_dict = occs_occurence_scanning_id_map_dict[1]
        self.name = occs[0].site.name
        self.occs = occs
        self.xpath = self.occs[0].site.xpaths.first().xpath

    def start_requests(self):
        for occ in self.occs:
            yield scrapy.Request(url=occ.url, callback=self.parse)

    def parse(self, response):
        successlog.debug('GOT RESPONSE')
        hxs = HtmlXPathSelector(response)

        text_raw = hxs.select(self.xpath+'/text()')

        # DO SOME STUFF IN DATABASE

这就是我运行抓取工具的方式:

def run_spiders():
    from scrapy.crawler import CrawlerProcess
    ...

    ua = UserAgent()

    process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
                              "EXTENSIONS": {
                                  'scrapy.telnet.TelnetConsole': None
                              },
                              "LOG_FILE":'scrapylog.log',
                              "CONCURRENT_REQUESTS":10,
                              "ROBOTSTXT_OBEY":False,
                              "USER_AGENT":ua.chrome,
                              })
    for s in Site.objects.all():
        occs = s.occurences.all()
        if occs:
            occs_occurence_scanning_id_map_dict = (occs, occurence_scanning_id_map_dict)
            process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)

    process.start()

0 个答案:

没有答案