Question

我已使用Selenium的Firefox驱动程序在我Scrapy项目的某些蜘蛛中加载和删除网页。

问题：
Selenium在运行所有蜘蛛时运行Firefox的一个实例，即那些我未导入webdriver且未调用webdriver.Firefox()的事件。

预期行为：
当我运行已使用webdriver.Firefox()的蜘蛛时，Selenium只运行Firfox 的实例。

为什么这很重要？
在蜘蛛完成后我退出了Firefox实例，但生动地说这不是在没有使用Selenium的蜘蛛中。

未使用Selenium的蜘蛛
这个蜘蛛没有使用Selenium，我希望它不会运行Firefox。

class MySpider(scrapy.Spider):
    name = "MySpider"
    domain = 'www.example.com'
    allowed_domains = ['http://example.com']
    start_urls = ['http://example.com']

    def parse(self, response):
        for sel in response.css('.main-content'):
            # Article is a scrapy.item
            item = Article()
            item['title'] = sel.css('h1::text').extract()[0]
            item['body'] = sel.css('p::text').extract()[0]
            yield item

Answer 1

问题实际上是我如何在用于使用Selenium的蜘蛛中实例化webdriver.Firefox模块：

class MySpider(scrapy.Spider):
    # basic scrapy setting
    driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        result = scrapy.Selector(text=self.driver.page_source)
        # scrap and yield items to pipeline
        # then in certain condition:
        self.driver.quit()

为什么会这样？
运行Scrapy命令时，python会解释项目中的所有类。所以无论我试图运行哪种蜘蛛，Selenium都会为包含此命令行的每个蜘蛛类运行webdriver.Firefox的新实例。

<强>解决方案
刚刚将webdriver实例移动到类init方法：

def __init__(self):
    self.driver = webdriver.Firefox()

Selenium为未使用的蜘蛛运行Firefox驱动程序

1 个答案: