精确定位Scrapy如何被重定向,以及如何绕过它

时间:2013-07-17 01:24:59

标签: web-scraping scrapy bots

当我在浏览器中提取mindbodyonline客户端的时间表时,我可以毫不费力地将Xpath添加到我想从页面中抓取的项目。但是,当我尝试使用scrapy shell抓取网站时,我的XPath永远不会返回任何对象。

例如,我尝试从scrapy shell抓取以下URL:

$ scrapy shell https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260

2013-07-15 15:50:45-0700 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 15:50:46-0700 [default] INFO: Spider opened
2013-07-15 15:50:53-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260> from <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
2013-07-15 15:50:55-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> from <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260>
2013-07-15 15:51:01-0700 [default] DEBUG: Crawled (200) <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html>\r\n\t<head>\r\n\t<title>Yoga Now Online'>
[s]   item       {}
[s]   request    <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
[s]   response   <200 https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x99480ac>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.4 (default, Apr 19 2013, 18:32:33) 

In [1]: response.body
Out[1]: '\r\n\t<html>\r\n\t<head>\r\n\t<title>Yoga Now Online</title>\r\n\t<meta http-equiv="Content-Type" content="text/html">\r\n\t<LINK REL="ICON" HREF="/favicon.ico">\r\n\t<LINK REL="SHORTCUT ICON" HREF="/favicon.ico">\r\n\t<script type="text/javascript">\r\n\r\nvar _gaq = _gaq || [];\r\n_gaq.push([\'_setAccount\', \'UA-19985881-2\']);\r\n_gaq.push([\'_setDomainName\', \'none\']);\r\n_gaq.push([\'_setAllowLinker\', true]);\r\n_gaq.push([\'_trackPageview\']);\r\n\r\n(function() {var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\nga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\nvar s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n})();\r\n\r\n</script><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/jquery.tooltip.css"  /><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/base/jquery.ui.all.css"  /><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-1.8.2.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.cookie-1.0.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.mb.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.libasync.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.widget.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.mouse.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.draggable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.droppable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.resizable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.dialog.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.autocomplete.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.position.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.highlight.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.datepicker.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.tooltip.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.ba-resize.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.lightboxLib.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.hoverIntent.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.smartFocus-0.1.js"></script>\r\n\r\n\r\n<script type="text/javascript">\r\n// filePath must be absolute with leading slash\r\nfunction contentUrl(filePath) {\r\n\t\r\n\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\r\n}\r\n\r\n(function ($) {\r\n\t//$.fn.extend({\r\n\t$.contentUrl = function (filePath) {\r\n\t\t//contentUrl: function (filePath) {\r\n\t\t\t\r\n\t\t\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\t\t\r\n\t};\r\n})(jQuery);\r\n\r\n$(function() {\r\n\r\n\t\r\n\t\t// init tooltips\r\n\t\t$("img[title],span[title],select[title],input[title],legend[title]").tooltip({\r\n\t\t\ttrack: true,\r\n\t\t\tshowURL: false,\r\n\t\t\tfade: 250\r\n\t\t});\r\n\t\t\r\n\t\r\n\t$(\'fieldset.collapsible\').setCollapseEvents();\r\n\t\r\n});\r\n</script>\r\n\r\n\r\n<script type="text/javascript">\r\n\r\nfunction launchHome() {\r\n\t\r\n\t\t\tdocument.wsLaunch.action = "home.asp?studioid=2260";\r\n\t\t\r\n\t\tdocument.wsLaunch.submit();\r\n\t}\r\n\t</script>\r\n\t</head>\r\n\t<body onLoad="launchHome();">\r\n\t<form name="wsLaunch" action="home.asp?studioid=2260" method="post">\r\n\t<input type="hidden" name="tg" value="" /> <input type="hidden" name="vt" value="" /> <input type="hidden" name="lvl" value="" /> <input type="hidden" name="stype" value="" /> <input type="hidden" name="qParam" value="" /> <input type="hidden" name="view" value="" /> <input type="hidden" name="trn" value="0" /> <input type="hidden" name="page" value="" /> <input type="hidden" name="catid" value="" /> <input type="hidden" name="prodid" value="" /> <input type="hidden" name="date" value="7/16/2013" /> <input type="hidden" name="classid" value="0" /> <input type="hidden" name="sSU" value="" /> <input type="hidden" name="optForwardingLink" value="" /> \r\n\t<input type="hidden" name="launchGUID" value="" />\r\n\t<input type="hidden" name="launchUID" value="" />\r\n\t<input type="hidden" name="launchPWDChange" value="" />\r\n\t<input type="hidden" name="launchPWDChangeKey" value="" />\r\n\t<input type="hidden" name="launchLostPWD" value="" />\r\n\t\r\n\t\r\n\t<input type="hidden" name="extLink" value="" />\r\n\t</form>\r\n\t<noscript>\r\n\tYou must have javascript enabled to use Yoga Now Online.\r\n\t</noscript>\r\n\t</body>\r\n\t</html>\r\n'

抱歉,您需要整理HTML,我稍后会尝试附上漂亮的版本。但问题是,我需要的数据不在scrapy crawl的响应中。但是,当我手动转到URL时,甚至是view(response)

存在以下HTML(这是我要抓取的数据):

<tr class="oddRow" style="width: 929px;">
<td style="width: 90px;">&nbsp;&nbsp;&nbsp;4:00&nbsp;pm </td><td style="width: 167px;"></td>
<td style="width: 172px;"><a class="modalClassDesc" name="cid617" href="javascript:;">Vinyasa (Level 1-2)</a></td>
<td style="width: 172px;"><a class="modalBio" name="bio100000375" href="javascript:;">Dietrich McGaffey</a></td>
<td style="width: 106px;">Main Yoga Room</td><td style="width: 162px;">&nbsp;1&nbsp;hour&nbsp;&amp;&nbsp;30&nbsp;minutes</td></tr>

所以前面是大局,我希望你对我想要完成的事情有个好主意。我想要抓取的HTML可以在浏览器中使用,但不能通过scrapy shell获得。我知道Scrapy正在被重定向。根据我花在调查上的时间,我相信问题是网站有javascript检测来阻止机器人,或者可能是scrapy没有正确处理cookie。

为了让自己更加困惑,这是cURL的输出:

curl https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="/ASP/ws.asp?studioid=2260">here</a>.</body>

当我按照cURL的链接时,它似乎发送给我一个无限循环的对象移动链接。

对不起,因为他很啰嗦,但我想彻底描述我的问题。如果有人有解决方案或指针如何进一步调查,我会重视你的输入。感谢您花时间托盘并帮助我。

1 个答案:

答案 0 :(得分:1)

使用Chrome,我从https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260重定向 到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260 (请参阅下面的编辑以获得解释)

Sitll使用Chrome,view-source:https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260显示该页面包含框架集

<frameset id="mainFrameset" frameborder="0" framespacing="0" NORESIZE>   
  <frame name="mainFrame" src="main_class.asp?tg=&amp;vt=&amp;lvl=&amp;stype=&amp;view=&amp;trn=0&amp;page=&amp;catid=&amp;prodid=&amp;date=7%2F16%2F2013&amp;classid=0&amp;sSU=&amp;optForwardingLink=&amp;qParam=&amp;justloggedin=&amp;nLgIn=&amp;pMode=" frameborder="10"  scrolling="YES" width="320">
</frameset>
<noframes> 
<body style="background-color:#FFFFFF;" text="#000000">
</body>
</noframes> 
</html>

所以我认为你需要获取对应于frame [@ name =“mainFrame”]的@src属性的页面

仍在Chrome下,来源:https://clients.mindbodyonline.com/ASP/main_class.asp?tg=&vt=&lvl=&stype=&view=&trn=0&page=&catid=&prodid=&date=7%2F16%2F2013&classid=0&sSU=&optForwardingLink=&qParam=&justloggedin=&nLgIn=&pMode= 确实有你正在寻找的<table id="classSchedule-mainTable" class="" cellspacing="0">


编辑:我使用scrapy shell测试了这个(我喜欢直接使用lxml.etree)

  import lxml.etree
  import lxml.html
  doc = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser())
  print lxml.etree.tostring(doc.xpath('head')[0], pretty_print=True)

并且它发生在浏览器中的重定向来自一点点Javascript(我不知道这是做什么的,但它似乎与行为匹配)

    <script type="text/javascript">&#13;
&#13;
function launchHome() {&#13;
    &#13;
            document.wsLaunch.action = "home.asp?studioid=2260";&#13;
        &#13;
        document.wsLaunch.submit();&#13;
    }&#13;
    </script>
  </head>
  <body onload="launchHome();">&#13;

response.url正在:

  response.url
  'https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true'

您将重定向转移到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260