我在NIH资助数据库上使用网络抓取工具,当我尝试检查响应对象时,我遇到了scrapy的shell问题。
输入此命令后:
if(isNetworkAvailable(this))
new GetData().execute();
else
//no internet connection, show your message here
我在shell启动时得到了这个输出:
scrapy shell https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284&icde=27160266
不幸的是,它冻结在那里并且永远不会继续,因此我无法操纵响应对象或向前移动。
如果我使用任何其他网址输入相同的命令,我几乎立即得到此响应:
$ 2015-11-20 01:18:10 [scrapy] INFO:
Scrapy 1.0.3 started (bot: tutorial)
2015-11-20 01:18:10 [scrapy] INFO: Optional features available: ssl, http11
2015-11-20 01:18:10 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'tutorial'}
2015-11-20 01:18:10 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-11-20 01:18:10 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 01:18:10 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 01:18:10 [scrapy] INFO: Enabled item pipelines:
2015-11-20 01:18:10 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6027
2015-11-20 01:18:10 [scrapy] INFO: Spider opened
2015-11-20 01:18:11 [scrapy] DEBUG: Crawled (200) <GET https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x101465d10>
[s] item {}
[s] request <GET https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284>
[s] response <200 https://projectreporter.nih.gov/project_info_results.cfm?aid=8535284>
[s] settings <scrapy.settings.Settings object at 0x1028bbbd0>
[s] spider <DefaultSpider 'default' at 0x1047b7510>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2015-11-20 01:18:12 [root] DEBUG: Using default logger
2015-11-20 01:18:12 [root] DEBUG: Using default logger
^^^&#39;在[1]中:&#39;表明请求和响应已正确执行。
还有什么,如果我通过&#39;另存为&#39;将我的初始网址的html下载为本地文件。并使用scrapy命令查询:
$ scrapy shell https://reddit.com
2015-11-20 01:20:24 [scrapy] INFO: Scrapy 1.0.3 started (bot: tutorial)
2015-11-20 01:20:24 [scrapy] INFO: Optional features available: ssl, http11
2015-11-20 01:20:24 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'tutorial'}
2015-11-20 01:20:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-11-20 01:20:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 01:20:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 01:20:24 [scrapy] INFO: Enabled item pipelines:
2015-11-20 01:20:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6028
2015-11-20 01:20:24 [scrapy] INFO: Spider opened
2015-11-20 01:20:24 [scrapy] DEBUG: Redirecting (301) to <GET https://www.reddit.com/> from <GET https://reddit.com>
2015-11-20 01:20:25 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x107ea0d10>
[s] item {}
[s] request <GET https://reddit.com>
[s] response <200 https://www.reddit.com/>
[s] settings <scrapy.settings.Settings object at 0x1092f7bd0>
[s] spider <DefaultSpider 'default' at 0x10ba31490>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
2015-11-20 01:20:26 [root] DEBUG: Using default logger
2015-11-20 01:20:26 [root] DEBUG: Using default logger
In [1]:
我让scrapy shell按预期工作。
我想这是一个长篇大论的问题,我正在使用的那个捣蛋scrapy shell的网址是什么?是不是我必须提交搜索才能到达页面?别的什么?任何帮助将不胜感激。