从Python脚本将参数传递给Scrapy Spider

时间:2015-06-26 07:04:04

标签: python web-scraping arguments scrapy scrapy-spider

我只提到在发布此问题之前我提到过的一些问题(在发布此问题之前,我目前没有链接到我提到过的所有问题) - :

我能够完全运行此代码,如果我没有传递参数并要求用户从BBSpider类输入(没有主函数 - 在name =“dmoz”行下方),或者将它们作为预定义(即静态)参数提供。

我的代码是here

我基本上尝试从Python脚本执行Scrapy蜘蛛而不需要任何其他文件(甚至是设置文件)。这就是为什么我在代码本身内部也指定了设置。

这是我执行此脚本时的输出 -

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

我目前面临的问题 - :

  • 如果仔细查看输出的第1行和第6行,我传递给我的蜘蛛的start_url打印了两次,即使我在代码的第31行只写了一次print语句(我给出了它的链接)以上)。为什么会发生这种情况,并且具有不同的值(第1行(我的输出)的初始打印语句输出给出正确的结果,虽然第6行(我的输出)的print语句输出?不仅如此,即使我写 - 打印'hi' - 然后它也打印两次。为什么会这样?
  • 接下来,如果你看到我输出的这一行 - :     TypeError:请求url必须是str或unicode,得到NoneType: 为什么会这样(即使我上面发布的问题的链接也写了同样的东西)?我不知道如何解决它?我甚至试过`self.start_urls = [str(kwargs.get('start_url'))]` - 然后它给出了以下输出:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决上述2个错误。

1 个答案:

答案 0 :(得分:7)

您需要在crawl的{​​{1}}方法上传递参数,因此您需要像这样运行它:

CrawlerProcess