我是scrapy和python的新手。我正在使用scrapy 0.17.0。 我已经在一个网站上设置了爬虫,在一些请求后发送给我一个验证码页面。我已经设置了10个并发请求。现在,当我获得验证页面时,我想要进一步请求,直到我下载验证码图像并解决它。
一旦我的验证码得到解决,我想恢复我的请求队列。但我不知道如何暂停请求队列。 当我获得302状态(这是验证码的页面)时,我已经添加了睡眠时间,但这不起作用。
下面的是我的settings.py
BOT_NAME = 'testBot'
SPIDER_MODULES = ['testCrawler.spiders']
NEWSPIDER_MODULE = 'testCrawler.spiders'
CONCURRENT_REQUESTS_PER_DOMAIN = 10
CONCURRENT_SPIDERS = 5
DOWNLOAD_DELAY = 5
COOKIES_ENABLED = 'false'
# SET USER AGENTS LIST
USER_AGENTS = ['Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; BTRS106490)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; .NET4.0E; .NET4.0C)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)',
'Mozilla/5.0 (X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0']
PROXIES = ['http://192.168.100.225:8123']
DOWNLOADDELAYLIST = ['3', '4', '6', '5']
RETRY_TIMES = 20
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408, 302]
这是我的抓取工具
import time
import re
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from testCrawler.items import linkItem
from testCrawler.imageItems import linkImageItem
class CategorySpider(CrawlSpider):
name = 'categoryLink'
allowed_domains = ['somedomail.com']
start_urls = ['http://somesite.com/topsearches']
def parse(self, response):
self.state['items_count'] = self.state.get('items_count', 0) + 1
self.logCaptchaPages(response.status, response.url)
hxs = HtmlXPathSelector(response)
catLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()
for catLink in catLinks:
if re.match('(.*?)/[0-9]+$', catLink):
continue
else:
yield Request(catLink, callback=self.alphaDetailPage)
def alphaDetailPage(self, aResponse):
self.logCaptchaPages(aResponse.status, aResponse.url)
hxs = HtmlXPathSelector(aResponse)
pageLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()
dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()
for dtlLink in dtlLinks:
yield Request(dtlLink, callback=self.listPageLinks)
for pageLink in pageLinks:
if re.match('(.*?)/[0-9]+$', pageLink):
yield Request(pageLink,callback=self.pageDetail)
def pageDetail(self, bResponse):
self.logCaptchaPages(bResponse.status, bResponse.url)
hxs = HtmlXPathSelector(bResponse)
dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()
for dtlLink in dtlLinks:
yield Request(dtlLink, callback=self.listPageLinks)
def listPageLinks(self, lResponse):
self.logCaptchaPages(lResponse.status, lResponse.url)
hxs = HtmlXPathSelector(lResponse)
similarSearchLinks = hxs.select('//a[@class="similar_search"]/@href').extract()
if len(similarSearchLinks) > 0:
for i in range(len(similarSearchLinks)):
yield Request(similarSearchLinks[i], callback=self.listPageLinks)
itm = linkItem()
titleList = hxs.select('//div[@id="h1-wrapper"]/h1/text()').extract()
if len(titleList) > 0:
itm['url'] = lResponse.url
itm['title'] = titleList[0]
yield itm
else:
yield
def logCaptchaPages(self, statusCode, urlToLog):
if statusCode == 302:
yield Request(urlToLog, callback=self.downloadImage)
time.sleep(10)
def downloadImage(self, iResponse):
hxs = HtmlXPathSelector(iResponse)
imageUrl = hxs.select('//body/img/@src').extract()[0]
itm = linkImageItem()
itm['url'] = iResponse.url
itm['image_urls'] = [imageUrl]
yield itm
目前我正在测试只有一个验证码图像下载,一旦它工作,我打算调用其他函数,它将发送一个带验证码文本的验证码页面的请求。一旦该验证码页面通过,我想处理下一个请求。
有关其无效的原因的任何想法?
在这种情况下,我可能做错了吗?任何人都可以指出哪里出错了?
非常感谢任何帮助。谢谢:))
答案 0 :(得分:0)
您可以尝试在time.sleep(10)
方法中交换yield Request(urlToLog, callback=self.downloadImage)
和logCaptchaPages
,以便在暂停10秒后返回您的请求。
def logCaptchaPages(self, statusCode, urlToLog):
if statusCode == 302:
print "Got CAPTCHA page"
time.sleep(10)
yield Request(urlToLog, callback=self.downloadImage)