当我在浏览器中提取mindbodyonline客户端的时间表时,我可以毫不费力地将Xpath添加到我想从页面中抓取的项目。但是,当我尝试使用scrapy shell抓取网站时,我的XPath永远不会返回任何对象。
例如,我尝试从scrapy shell抓取以下URL:
$ scrapy shell https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260
2013-07-15 15:50:45-0700 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled item pipelines:
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 15:50:46-0700 [default] INFO: Spider opened
2013-07-15 15:50:53-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260> from <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
2013-07-15 15:50:55-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> from <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260>
2013-07-15 15:51:01-0700 [default] DEBUG: Crawled (200) <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html>\r\n\t<head>\r\n\t<title>Yoga Now Online'>
[s] item {}
[s] request <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
[s] response <200 https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x99480ac>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.4 (default, Apr 19 2013, 18:32:33)
In [1]: response.body
Out[1]: '\r\n\t<html>\r\n\t<head>\r\n\t<title>Yoga Now Online</title>\r\n\t<meta http-equiv="Content-Type" content="text/html">\r\n\t<LINK REL="ICON" HREF="/favicon.ico">\r\n\t<LINK REL="SHORTCUT ICON" HREF="/favicon.ico">\r\n\t<script type="text/javascript">\r\n\r\nvar _gaq = _gaq || [];\r\n_gaq.push([\'_setAccount\', \'UA-19985881-2\']);\r\n_gaq.push([\'_setDomainName\', \'none\']);\r\n_gaq.push([\'_setAllowLinker\', true]);\r\n_gaq.push([\'_trackPageview\']);\r\n\r\n(function() {var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\nga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\nvar s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n})();\r\n\r\n</script><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/jquery.tooltip.css" /><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/base/jquery.ui.all.css" /><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-1.8.2.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.cookie-1.0.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.mb.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.libasync.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.widget.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.mouse.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.draggable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.droppable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.resizable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.dialog.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.autocomplete.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.position.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.highlight.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.datepicker.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.tooltip.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.ba-resize.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.lightboxLib.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.hoverIntent.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.smartFocus-0.1.js"></script>\r\n\r\n\r\n<script type="text/javascript">\r\n// filePath must be absolute with leading slash\r\nfunction contentUrl(filePath) {\r\n\t\r\n\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\r\n}\r\n\r\n(function ($) {\r\n\t//$.fn.extend({\r\n\t$.contentUrl = function (filePath) {\r\n\t\t//contentUrl: function (filePath) {\r\n\t\t\t\r\n\t\t\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\t\t\r\n\t};\r\n})(jQuery);\r\n\r\n$(function() {\r\n\r\n\t\r\n\t\t// init tooltips\r\n\t\t$("img[title],span[title],select[title],input[title],legend[title]").tooltip({\r\n\t\t\ttrack: true,\r\n\t\t\tshowURL: false,\r\n\t\t\tfade: 250\r\n\t\t});\r\n\t\t\r\n\t\r\n\t$(\'fieldset.collapsible\').setCollapseEvents();\r\n\t\r\n});\r\n</script>\r\n\r\n\r\n<script type="text/javascript">\r\n\r\nfunction launchHome() {\r\n\t\r\n\t\t\tdocument.wsLaunch.action = "home.asp?studioid=2260";\r\n\t\t\r\n\t\tdocument.wsLaunch.submit();\r\n\t}\r\n\t</script>\r\n\t</head>\r\n\t<body onLoad="launchHome();">\r\n\t<form name="wsLaunch" action="home.asp?studioid=2260" method="post">\r\n\t<input type="hidden" name="tg" value="" /> <input type="hidden" name="vt" value="" /> <input type="hidden" name="lvl" value="" /> <input type="hidden" name="stype" value="" /> <input type="hidden" name="qParam" value="" /> <input type="hidden" name="view" value="" /> <input type="hidden" name="trn" value="0" /> <input type="hidden" name="page" value="" /> <input type="hidden" name="catid" value="" /> <input type="hidden" name="prodid" value="" /> <input type="hidden" name="date" value="7/16/2013" /> <input type="hidden" name="classid" value="0" /> <input type="hidden" name="sSU" value="" /> <input type="hidden" name="optForwardingLink" value="" /> \r\n\t<input type="hidden" name="launchGUID" value="" />\r\n\t<input type="hidden" name="launchUID" value="" />\r\n\t<input type="hidden" name="launchPWDChange" value="" />\r\n\t<input type="hidden" name="launchPWDChangeKey" value="" />\r\n\t<input type="hidden" name="launchLostPWD" value="" />\r\n\t\r\n\t\r\n\t<input type="hidden" name="extLink" value="" />\r\n\t</form>\r\n\t<noscript>\r\n\tYou must have javascript enabled to use Yoga Now Online.\r\n\t</noscript>\r\n\t</body>\r\n\t</html>\r\n'
抱歉,您需要整理HTML,我稍后会尝试附上漂亮的版本。但问题是,我需要的数据不在scrapy crawl
的响应中。但是,当我手动转到URL时,甚至是view(response)
存在以下HTML(这是我要抓取的数据):
<tr class="oddRow" style="width: 929px;">
<td style="width: 90px;"> 4:00 pm </td><td style="width: 167px;"></td>
<td style="width: 172px;"><a class="modalClassDesc" name="cid617" href="javascript:;">Vinyasa (Level 1-2)</a></td>
<td style="width: 172px;"><a class="modalBio" name="bio100000375" href="javascript:;">Dietrich McGaffey</a></td>
<td style="width: 106px;">Main Yoga Room</td><td style="width: 162px;"> 1 hour & 30 minutes</td></tr>
所以前面是大局,我希望你对我想要完成的事情有个好主意。我想要抓取的HTML可以在浏览器中使用,但不能通过scrapy shell获得。我知道Scrapy正在被重定向。根据我花在调查上的时间,我相信问题是网站有javascript检测来阻止机器人,或者可能是scrapy没有正确处理cookie。
为了让自己更加困惑,这是cURL的输出:
curl https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="/ASP/ws.asp?studioid=2260">here</a>.</body>
当我按照cURL的链接时,它似乎发送给我一个无限循环的对象移动链接。
对不起,因为他很啰嗦,但我想彻底描述我的问题。如果有人有解决方案或指针如何进一步调查,我会重视你的输入。感谢您花时间托盘并帮助我。
答案 0 :(得分:1)
使用Chrome,我从https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260重定向 到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260 (请参阅下面的编辑以获得解释)
Sitll使用Chrome,view-source:https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260显示该页面包含框架集
<frameset id="mainFrameset" frameborder="0" framespacing="0" NORESIZE>
<frame name="mainFrame" src="main_class.asp?tg=&vt=&lvl=&stype=&view=&trn=0&page=&catid=&prodid=&date=7%2F16%2F2013&classid=0&sSU=&optForwardingLink=&qParam=&justloggedin=&nLgIn=&pMode=" frameborder="10" scrolling="YES" width="320">
</frameset>
<noframes>
<body style="background-color:#FFFFFF;" text="#000000">
</body>
</noframes>
</html>
所以我认为你需要获取对应于frame [@ name =“mainFrame”]的@src属性的页面
仍在Chrome下,来源:https://clients.mindbodyonline.com/ASP/main_class.asp?tg=&vt=&lvl=&stype=&view=&trn=0&page=&catid=&prodid=&date=7%2F16%2F2013&classid=0&sSU=&optForwardingLink=&qParam=&justloggedin=&nLgIn=&pMode=
确实有你正在寻找的<table id="classSchedule-mainTable" class="" cellspacing="0">
编辑:我使用scrapy shell测试了这个(我喜欢直接使用lxml.etree)
import lxml.etree
import lxml.html
doc = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser())
print lxml.etree.tostring(doc.xpath('head')[0], pretty_print=True)
并且它发生在浏览器中的重定向来自一点点Javascript(我不知道这是做什么的,但它似乎与行为匹配)
<script type="text/javascript">
function launchHome() {
document.wsLaunch.action = "home.asp?studioid=2260";
document.wsLaunch.submit();
}
</script>
</head>
<body onload="launchHome();">
response.url
正在:
response.url
'https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true'
您将重定向转移到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260。