当我尝试创建&通过scrapy shell或直接通过脚本运行一个非常基本的spider版本,我收到以下错误:
C:\Users\aayus\scraping\datablogger_scraper>scrapy shell data-blogger.com
2018-03-09 12:06:47 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: datablogger_scraper)
2018-03-09 12:06:47 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.5.4 |Anaconda 4.4.0 (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.0.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-03-09 12:06:47 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'datablogger_scraper', 'SPIDER_MODULES': ['datablogger_scraper.spiders'], 'NEWSPIDER_MODULE': 'datablogger_scraper.spiders'}
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-09 12:06:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-09 12:06:47 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\scrapy.exe\__main__.py", line 9, in <module>
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\commands\shell.py", line 73, in run
shell.start(url=url, redirect=not opts.no_redirect)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 48, in start
self.fetch(url, spider, redirect=redirect)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 115, in fetch
reactor, self._schedule, request, spider)
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\threads.py", line 122, in blockingCallFromThread
result.raiseException()
File "c:\programdata\anaconda3\lib\site-packages\twisted\python\failure.py", line 385, in raiseException
raise self.value.with_traceback(self.tb)
ValueError: invalid literal for int() with base 10: 'port'
C:\Users\aayus\scraping\datablogger_scraper>scrapy shell google.com
2018-03-09 12:07:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: datablogger_scraper)
2018-03-09 12:07:04 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.5.4 |Anaconda 4.4.0 (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.0.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-03-09 12:07:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'datablogger_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'datablogger_scraper.spiders', 'SPIDER_MODULES': ['datablogger_scraper.spiders']}
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-09 12:07:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-09 12:07:04 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\scrapy.exe\__main__.py", line 9, in <module>
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\commands\shell.py", line 73, in run
shell.start(url=url, redirect=not opts.no_redirect)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 48, in start
self.fetch(url, spider, redirect=redirect)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 115, in fetch
reactor, self._schedule, request, spider)
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\threads.py", line 122, in blockingCallFromThread
result.raiseException()
File "c:\programdata\anaconda3\lib\site-packages\twisted\python\failure.py", line 385, in raiseException
raise self.value.with_traceback(self.tb)
ValueError: invalid literal for int() with base 10: 'port'
我试过访问不同的网站,但无济于事。 Scrapy外壳似乎一直工作到昨天,但我不知道从那以后发生了什么变化。
答案 0 :(得分:0)
我最好的猜测是你以某种方式将无效代理推送到HttpProxyMiddleware(最有可能通过meta [&#39; proxy&#39;])。这是我能想象一个字符串&#39;端口&#39;最终会出现在一个扭曲的地方,这个地方需要一个端口的整数。
当然,如果您发布刮刀代码,诊断问题会更容易。