Question

我有一个通用的Spider类，它使用不同的url列表（不同的域）进行实例化。

因此，对于example.com，有一个蜘蛛实例，amazon.com另一个。

GenericSpider已将DOWNLOAD_DELAY设置为0.5秒，以防止被禁止或过载某人。

每只蜘蛛平均有5个网址（有时是50个，有时是1个网址）

我可以在日志中看到有很多Timeout错误。是否可能是蜘蛛设置有问题？我可以在日志中看到数百个这样的错误。超过一半的请求以GIVE UP重试结束。

编辑：当我降低蜘蛛数量时，几乎没有超时（不是50％的请求，而是0.5％）

我只下载了html，所以在15秒内下载它应该没有问题。

在日志中，有数百个：

2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying ... User timeout caused connection failure.
2017-03-31 11:08:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET ...> (failed 3 times): User timeout caused connection failure: Getting ... took longer than 15 seconds..

这是我的蜘蛛：

class GenericSpider(scrapy.Spider):
    download_timeout = 15
    name = 'will_be_overriden'
    custom_settings = {'CONCURRENT_REQUESTS': 10,
                       'DOWNLOAD_DELAY':0.5}
    def __init__(self, occs_occurence_scanning_id_map_dict):
        super(GenericScraper,self).__init__()
        occs = occs_occurence_scanning_id_map_dict[0]
        self.occurence_scanning_id_map_dict = occs_occurence_scanning_id_map_dict[1]
        self.name = occs[0].site.name
        self.occs = occs
        self.xpath = self.occs[0].site.xpaths.first().xpath

    def start_requests(self):
        for occ in self.occs:
            yield scrapy.Request(url=occ.url, callback=self.parse)

    def parse(self, response):
        successlog.debug('GOT RESPONSE')
        hxs = HtmlXPathSelector(response)

        text_raw = hxs.select(self.xpath+'/text()')

        # DO SOME STUFF IN DATABASE

这就是我运行抓取工具的方式：

def run_spiders():
    from scrapy.crawler import CrawlerProcess
    ...

    ua = UserAgent()

    process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
                              "EXTENSIONS": {
                                  'scrapy.telnet.TelnetConsole': None
                              },
                              "LOG_FILE":'scrapylog.log',
                              "CONCURRENT_REQUESTS":10,
                              "ROBOTSTXT_OBEY":False,
                              "USER_AGENT":ua.chrome,
                              })
    for s in Site.objects.all():
        occs = s.occurences.all()
        if occs:
            occs_occurence_scanning_id_map_dict = (occs, occurence_scanning_id_map_dict)
            process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict)

    process.start()

Scrapy - 太多超时（超过一半）

0 个答案: