我正在尝试覆盖脚本中调用的爬虫的某些设置,但这些设置似乎没有生效:
from scrapy import log
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from someproject.spiders import SomeSpider
spider = SomeSpider()
overrides = {
'LOG_ENABLED': True,
'LOG_STDOUT': True,
}
settings = get_project_settings()
settings.overrides.update(overrides)
log.start()
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
crawler.crawl(spider)
crawler.start()
在蜘蛛中:
from scrapy.spider import BaseSpider
class SomeSpider(BaseSpider):
def __init__(self):
self.start_urls = [ 'http://somedomain.com' ]
def parse(self, response):
print 'some test' # won't print anything
exit(0) # will normally exit failing the crawler
通过定义LOG_ENABLED
和LOG_STDOUT
,我希望在日志中看到“some test”字符串。此外,我似乎无法将日志重定向到我尝试过的其他一些设置中的LOG_FILE
。
我一定是做错了什么...... 请帮忙。 = d
答案 0 :(得分:0)
使用log.msg('some test')
打印日志
答案 1 :(得分:0)
启动爬虫后,您可能需要启动Twisted的反应器:
from twisted.internet import reactor
#...other imports...
#...setup crawler...
crawler.start()
reactor.run()
相关问题/更多代码:Scrapy crawl from script always blocks script execution after scraping