能够在从脚本运行scrapy时更改设置

时间:2015-10-13 04:45:37

标签: python scrapy

我想要run scrapy from a single script,我想从settings.py获取所有设置,但我希望能够更改其中一些设置:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

*### so what im missing here is being able to set or override one or two of the settings###*


# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

我无法使用this。我尝试了以下方法:

settings=scrapy.settings.Settings()
settings.set('RETRY_TIMES',10)

但它不起作用。

注意:我正在使用最新版本的scrapy。

3 个答案:

答案 0 :(得分:4)

因此,为了覆盖某些设置,一种方法是在我们的脚本中覆盖/设置custom_settings,即蜘蛛的静态变量。

所以我导入了蜘蛛的类,然后覆盖了custom_setting:

from testspiders.spiders.followall import FollowAllSpider 

FollowAllSpider.custom_settings={'RETRY_TIMES':10}

所以这是整个剧本:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from testspiders.spiders.followall import FollowAllSpider 

FollowAllSpider.custom_settings={'RETRY_TIMES':10}
process = CrawlerProcess(get_project_settings())


# 'followall' is the name of one of the spiders of the project.
process.crawl('testspider', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

答案 1 :(得分:1)

出于某种原因,上述脚本对我不起作用。相反,我写了以下内容并且它有效。如果其他人遇到同样的问题,则发布。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
process.settings.set(
            'RETRY_TIMES', 10, priority='cmdline')

process.crawl('testspider', domain='scrapinghub.com')
process.start()

答案 2 :(得分:0)

我自己遇到了这个问题,并有一个稍微不同的解决方案,它使用现代 Python (>=3.5) 方法

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

settings = {
    **get_project_settings(),
    'RETRY_TIMES': 2
}


process = CrawlerProcess(settings)
process.crawl('testspider', domain='scrapinghub.com')
process.start()