从crawl命令和CrawlerProcess运行spider不会输出相同的内容

时间:2017-07-11 20:12:03

标签: python django scrapy

我使用

实现了一个scrapy蜘蛛
scrapy crawl myspider -a start_url='http://www.google.com'

现在我需要从脚本运行该蜘蛛(来自django应用程序,使用django-rq但不应该对该问题产生任何影响)。

因此,我按照CrawlerProcess文档结束了这样的脚本

crawler_settings = Settings()
crawler_settings.setmodule(cotextractor_settings)

process = CrawlerProcess(settings=crawler_settings)
process.crawl(MySpider(start_url='http://www.google.com'))
process.start()

问题是,从脚本中,由于缺少start_url arg,我的抓取失败。 在深入研究两个蜘蛛输出后,我注意到第二个(来自脚本)显示了两次我在我的蜘蛛构造函数中设置的调试命令。

这是构造函数

def __init__(self, *args, **kwargs): 
    super(MySpider, self).__init__(*args, **kwargs)
    logger.debug(kwargs)
    self.start_urls = [kwargs.get('start_url')]

这是抓取命令输出;注意只有1个调试输出

2017-07-11 21:53:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: cotextractor)
2017-07-11 21:53:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'cotextractor', 'DUPEFILTER_CLASS': 'cotextractor.dupefilters.PersistentDupeFilter', 'NEWSPIDER_MODULE': 'cotextractor.spiders', 'SPIDER_MODULES': ['cotextractor.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'}
2017-07-11 21:53:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-11 21:53:12 [cotextractor.spiders.spiders] DEBUG: {'start_url': 'http://www.google.com'}
2017-07-11 21:53:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'cotextractor.middlewares.RotatingProxyMiddleware',
 'cotextractor.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-11 21:53:13 [scrapy.middleware] INFO: Enabled spider middlewares:
...

最后是脚本输出(django-rq worker);注意出现两次的调试,一次好,第二次空

2017-07-11 21:59:27 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: cotextractor)
2017-07-11 21:59:27 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'cotextractor', 'DUPEFILTER_CLASS': 'cotextractor.dupefilters.PersistentDupeFilter', 'NEWSPIDER_MODULE': 'cotextractor.spiders', 'SPIDER_MODULES': ['cotextractor.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'}
2017-07-11 21:59:27 [cotextractor.spiders.spiders] DEBUG: {'start_url': 'http://www.google.com'}
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-11 21:59:27 [cotextractor.spiders.spiders] DEBUG: {}
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'cotextractor.middlewares.RotatingProxyMiddleware',
 'cotextractor.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled spider middlewares:
...

我的猜测是我的脚本失败,因为它的构造函数被调用了两次;一旦参数和一次没有。然而,我无法弄清楚为什么CrawlerProcess会触发两次蜘蛛构造函数。

感谢您的支持,

1 个答案:

答案 0 :(得分:1)

好的。

正如此处已解释Passing arguments to process.crawl in Scrapy python

我实际上没有正确使用抓取方法。我不需要发送蜘蛛对象而只需要发送蜘蛛的名字! 所以,这是我必须使用的脚本

crawler_settings = Settings()
crawler_settings.setmodule(cotextractor_settings)

process = CrawlerProcess(settings=crawler_settings)
process.crawl(MySpider, start_url='http://www.google.com')
process.start()

文件很简单; crawl作为参数接收爬虫或蜘蛛的名字......

https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess

太糟糕了,我在这个上花了好几个小时

希望这可以帮助某人;)