Scrapy Shell对某些网址没有反应

时间:2015-11-20 09:28:00

标签: python html shell web-crawler scrapy

我在NIH资助数据库上使用网络抓取工具,当我尝试检查响应对象时,我遇到了scrapy的shell问题。

输入此命令后:

if(isNetworkAvailable(this))
    new GetData().execute();
else
    //no internet connection, show your message here

我在shell启动时得到了这个输出:

scrapy shell https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284&icde=27160266

不幸的是,它冻结在那里并且永远不会继续,因此我无法操纵响应对象或向前移动。

如果我使用任何其他网址输入相同的命令,我几乎立即得到此响应:

$ 2015-11-20 01:18:10 [scrapy] INFO: 

Scrapy 1.0.3 started (bot: tutorial)
2015-11-20 01:18:10 [scrapy] INFO: Optional features available: ssl, http11
2015-11-20 01:18:10 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'tutorial'}
2015-11-20 01:18:10 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-11-20 01:18:10 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 01:18:10 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 01:18:10 [scrapy] INFO: Enabled item pipelines: 
2015-11-20 01:18:10 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6027
2015-11-20 01:18:10 [scrapy] INFO: Spider opened
2015-11-20 01:18:11 [scrapy] DEBUG: Crawled (200) <GET https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x101465d10>
[s]   item       {}
[s]   request    <GET https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284>
[s]   response   <200 https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284>
[s]   settings   <scrapy.settings.Settings object at 0x1028bbbd0>
[s]   spider     <DefaultSpider 'default' at 0x1047b7510>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-11-20 01:18:12 [root] DEBUG: Using default logger
2015-11-20 01:18:12 [root] DEBUG: Using default logger

^^^&#39;在[1]中:&#39;表明请求和响应已正确执行。

还有什么,如果我通过&#39;另存为&#39;将我的初始网址的html下载为本地文件。并使用scrapy命令查询:

$ scrapy shell https://reddit.com
2015-11-20 01:20:24 [scrapy] INFO: Scrapy 1.0.3 started (bot: tutorial)
2015-11-20 01:20:24 [scrapy] INFO: Optional features available: ssl, http11
2015-11-20 01:20:24 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'tutorial'}
2015-11-20 01:20:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-11-20 01:20:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 01:20:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 01:20:24 [scrapy] INFO: Enabled item pipelines: 
2015-11-20 01:20:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6028
2015-11-20 01:20:24 [scrapy] INFO: Spider opened
2015-11-20 01:20:24 [scrapy] DEBUG: Redirecting (301) to <GET https://www.reddit.com/> from <GET https://reddit.com>
2015-11-20 01:20:25 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x107ea0d10>
[s]   item       {}
[s]   request    <GET https://reddit.com>
[s]   response   <200 https://www.reddit.com/>
[s]   settings   <scrapy.settings.Settings object at 0x1092f7bd0>
[s]   spider     <DefaultSpider 'default' at 0x10ba31490>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-11-20 01:20:26 [root] DEBUG: Using default logger
2015-11-20 01:20:26 [root] DEBUG: Using default logger

In [1]: 

我让scrapy shell按预期工作。

我想这是一个长篇大论的问题,我正在使用的那个捣蛋scrapy shell的网址是什么?是不是我必须提交搜索才能到达页面?别的什么?任何帮助将不胜感激。

0 个答案:

没有答案