如何在我的功能完成之前保留进一步的请求,然后在scrapy中恢复请求队列?

时间:2013-05-17 07:44:56

标签: python scrapy

我是scrapy和python的新手。我正在使用scrapy 0.17.0。 我已经在一个网站上设置了爬虫,在一些请求后发送给我一个验证码页面。我已经设置了10个并发请求。现在,当我获得验证页面时,我想要进一步请求,直到我下载验证码图像并解决它。

一旦我的验证码得到解决,我想恢复我的请求队列。但我不知道如何暂停请求队列。 当我获得302状态(这是验证码的页面)时,我已经添加了睡眠时间,但这不起作用。

下面的

是我的settings.py

    BOT_NAME = 'testBot'
    SPIDER_MODULES = ['testCrawler.spiders']
    NEWSPIDER_MODULE = 'testCrawler.spiders'

    CONCURRENT_REQUESTS_PER_DOMAIN = 10
    CONCURRENT_SPIDERS = 5

    DOWNLOAD_DELAY = 5
    COOKIES_ENABLED = 'false'

    # SET USER AGENTS LIST
    USER_AGENTS = ['Mozilla/4.0  (compatible; MSIE 6.0; Windows NT 5.1; SV1; BTRS106490)',
                'Mozilla/4.0  (compatible; MSIE 7.0; Windows NT 6.2; .NET4.0E; .NET4.0C)',
                'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)',
                'Mozilla/5.0 (X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0']

    PROXIES = ['http://192.168.100.225:8123']

    DOWNLOADDELAYLIST = ['3', '4', '6', '5']

    RETRY_TIMES = 20
    RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408, 302]

这是我的抓取工具

    import time
    import re
    from scrapy.http import Request
    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from testCrawler.items import linkItem
    from testCrawler.imageItems import linkImageItem

    class CategorySpider(CrawlSpider):
        name = 'categoryLink'
        allowed_domains = ['somedomail.com']
        start_urls = ['http://somesite.com/topsearches']

        def parse(self, response):
            self.state['items_count'] = self.state.get('items_count', 0) + 1
            self.logCaptchaPages(response.status, response.url)

            hxs = HtmlXPathSelector(response)
            catLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()

            for catLink in catLinks:
                if re.match('(.*?)/[0-9]+$', catLink):
                    continue
                else:
                    yield Request(catLink, callback=self.alphaDetailPage)

        def alphaDetailPage(self, aResponse):
            self.logCaptchaPages(aResponse.status, aResponse.url)
            hxs = HtmlXPathSelector(aResponse)
            pageLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()
            dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()

            for dtlLink in dtlLinks:
                yield Request(dtlLink, callback=self.listPageLinks)

            for pageLink in pageLinks:
                if re.match('(.*?)/[0-9]+$', pageLink):
                    yield Request(pageLink,callback=self.pageDetail)

        def pageDetail(self, bResponse):
            self.logCaptchaPages(bResponse.status, bResponse.url)
            hxs = HtmlXPathSelector(bResponse)
            dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()

            for dtlLink in dtlLinks:
                yield Request(dtlLink, callback=self.listPageLinks)

        def listPageLinks(self, lResponse):
            self.logCaptchaPages(lResponse.status, lResponse.url)
            hxs = HtmlXPathSelector(lResponse)
            similarSearchLinks = hxs.select('//a[@class="similar_search"]/@href').extract()

            if len(similarSearchLinks) > 0:
                for i in range(len(similarSearchLinks)):
                    yield Request(similarSearchLinks[i], callback=self.listPageLinks)

            itm = linkItem()
            titleList = hxs.select('//div[@id="h1-wrapper"]/h1/text()').extract()

            if len(titleList) > 0:
                itm['url'] = lResponse.url
                itm['title'] = titleList[0]
                yield itm
            else:
                yield

        def logCaptchaPages(self, statusCode, urlToLog):
            if statusCode == 302:
                yield Request(urlToLog, callback=self.downloadImage)
                time.sleep(10)

        def downloadImage(self, iResponse):
            hxs = HtmlXPathSelector(iResponse)
            imageUrl = hxs.select('//body/img/@src').extract()[0]
            itm = linkImageItem()
            itm['url'] = iResponse.url
            itm['image_urls'] = [imageUrl]
            yield itm

目前我正在测试只有一个验证码图像下载,一旦它工作,我打算调用其他函数,它将发送一个带验证码文本的验证码页面的请求。一旦该验证码页面通过,我想处理下一个请求。

有关其无效的原因的任何想法?

在这种情况下,我可能做错了吗?任何人都可以指出哪里出错了?

非常感谢任何帮助。谢谢:))

1 个答案:

答案 0 :(得分:0)

您可以尝试在time.sleep(10)方法中交换yield Request(urlToLog, callback=self.downloadImage)logCaptchaPages,以便在暂停10秒后返回您的请求。

def logCaptchaPages(self, statusCode, urlToLog):
    if statusCode == 302:
        print "Got CAPTCHA page"
        time.sleep(10)
        yield Request(urlToLog, callback=self.downloadImage)