我在收到spider_idle
信号时,几个蜘蛛之间有共同的行为,我想将此行为转移到扩展名中。
我的分机已成功收听spider_opened
和spider_closed
信号。但是,未收到spider_idle
信号。
这是我的扩展程序(为简洁起见编辑):
import logging
import MySQLdb
import MySQLdb.cursors
from scrapy import signals
logger = logging.getLogger(__name__)
class MyExtension(object):
def __init__(self, settings, stats):
self.settings = settings
self.stats = stats
@classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
ext = cls(crawler.settings, crawler.stats)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
# return the extension object
return ext
def spider_opened(self, spider):
logger.info("start logging spider %s", spider.name)
def spider_closed(self, spider):
logger.info("end logging spider %s", spider.name)
def spider_idle(self, spider):
logger.info("idle logging spider %s", spider.name)
# attempt to crawl orphaned products
db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
port=self.settings['AWS_RDS_PORT'],
user=self.settings['AWS_RDS_USER'],
passwd=self.settings['AWS_RDS_PASSWD'],
db=self.settings['AWS_RDS_DB'],
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
charset="utf8",)
c=db.cursor()
c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (spider.website_id,))
while True:
product = c.fetchone()
if product is None:
break
# record orphaned product
self.stats.inc_value('orphaned_count')
yield self.crawler.engine.crawl(Request(url=product['url'], callback=spider.parse_item), spider)
db.close()
为什么没有收到信号?
更新
根据要求,这里有更多信息。
这是我的settings.py
:
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 30
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
EXTENSIONS = {
'myproject.extensions.MyExtension': 500,
}
无论是否启用了rotation_proxy中间件,我都会遇到同样的问题。
这是我正在测试的蜘蛛示例:
import scrapy
from furl import furl
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from myproject.items import MyProjectItem
class ScrapeThisSpider(CrawlSpider):
name = "scrapethis"
website_id = 5
custom_settings = {
'IMAGES_STORE': 's3://myproject-dev/images/scrapethis/',
'LOG_FILE': 'scrapethis.log',
'LOG_LEVEL': 'DEBUG',
}
allowed_domains = ['scrapethis.co.uk']
start_urls = ['https://www.scrapethis.co.uk']
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True)
def customise_url(link):
f = furl(link)
f.args['ar'] = '3'
return f.url
rules = (
Rule(LinkExtractor(allow=(), deny=('articleId=', ), process_value=customise_url)),
Rule(LinkExtractor(allow=('articleId=', ), process_value=customise_url), callback='parse_item'),
)
def parse_item(self, response):
price = response.xpath("//article[@class='productbox'] //strong[@class='pricefield__price']//text()").extract()
item = MyProjectItem()
f = furl(response.url)
item['id'] = f.args['articleId']
item['spider'] = self.name
item['price'] = price
return item
更新2
我想我已经发现导致spider_idle
信号失败的原因 - 在我的方法中,我正在连接到Amazon RDS数据库,查询它并处理结果。
如果我注释掉代码,我的信号就会运行(我得到日志条目),如果我的查询代码仍然存在,则信号不会运行(或者至少,我没有输入日志条目)。
这很奇怪,因为我在方法中做的第一件事就是记录信号?
更新3
我发现如果在查询结果循环中删除yield
关键字,它可以正常工作 - 但只有1个请求。我需要将查询中返回的每个网址添加到抓取工具中。 (道歉,如果我问任何愚蠢的事情 - 我仍然是Python和scrapy的新手。)