Scrapy Spider显示ValueError:int()的文字无效

时间:2018-03-09 06:44:25

标签: python scrapy web-crawler

当我尝试创建&通过scrapy shell或直接通过脚本运行一个非常基本的spider版本,我收到以下错误:

C:\Users\aayus\scraping\datablogger_scraper>scrapy shell data-blogger.com
2018-03-09 12:06:47 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: datablogger_scraper)
2018-03-09 12:06:47 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.5.4 |Anaconda 4.4.0 (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.0.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-03-09 12:06:47 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'datablogger_scraper', 'SPIDER_MODULES': ['datablogger_scraper.spiders'], 'NEWSPIDER_MODULE': 'datablogger_scraper.spiders'}
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-09 12:06:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-09 12:06:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-09 12:06:47 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\scrapy.exe\__main__.py", line 9, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\commands\shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "c:\programdata\anaconda3\lib\site-packages\twisted\python\failure.py", line 385, in raiseException
    raise self.value.with_traceback(self.tb)
ValueError: invalid literal for int() with base 10: 'port'

C:\Users\aayus\scraping\datablogger_scraper>scrapy shell google.com
2018-03-09 12:07:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: datablogger_scraper)
2018-03-09 12:07:04 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.5.4 |Anaconda 4.4.0 (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.0.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-03-09 12:07:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'datablogger_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'datablogger_scraper.spiders', 'SPIDER_MODULES': ['datablogger_scraper.spiders']}
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-09 12:07:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-09 12:07:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-09 12:07:04 [scrapy.core.engine] INFO: Spider opened
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\scrapy.exe\__main__.py", line 9, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\commands\shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\shell.py", line 115, in fetch
    reactor, self._schedule, request, spider)
  File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "c:\programdata\anaconda3\lib\site-packages\twisted\python\failure.py", line 385, in raiseException
    raise self.value.with_traceback(self.tb)
ValueError: invalid literal for int() with base 10: 'port'

我试过访问不同的网站,但无济于事。 Scrapy外壳似乎一直工作到昨天,但我不知道从那以后发生了什么变化。

1 个答案:

答案 0 :(得分:0)

我最好的猜测是你以某种方式将无效代理推送到HttpProxyMiddleware(最有可能通过meta [&#39; proxy&#39;])。这是我能想象一个字符串&#39;端口&#39;最终会出现在一个扭曲的地方,这个地方需要一个端口的整数。

当然,如果您发布刮刀代码,诊断问题会更容易。