如何在简单的网页抓取中停止302网址重定向?

时间:2016-12-28 08:56:14

标签: python http python-requests httpresponse http-status-code-302

我正在尝试使用Python中的Requests库抓取网站,当我尝试:

r = requests.get('http://www.cell.com/cell-stem-cell/home', allow_redirects = False)
>>> r.status_code
302
>>> r.text
'The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site">here</a>\n'

当我尝试时:

>>> r = requests.get("https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site")
>>>
>>> r.text
'\n\n\n\n\n<style type="text/css">\n    .hidden {\n        display: none;\n        visibility: hidden;\n    }\n</style>\n\n<!-- hidden iFrame for each of the SSO URLs -->\n<div class="hidden">\n    \n        <iframe src="//acw.secure.jbs.elsevierhealth.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n        <iframe src="//acw.sciencedirect.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n        <iframe src="//acw.scopus.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n        <iframe src="//acw.sciverse.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n        <iframe src="//acw.mendeley.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n        <iframe src="//acw.elsevier.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n    \n</div>\n\n\n\n<noscript>\n    <a href="CANT POST LINK BECAUSE OF LACK OF REPUTATION POINTS OF STACK OVERFLOW">Redirect</a>\n</noscript>\n\n<!-- redirect to the product page after all iFrames are rendered -->\n<script>\n    setTimeout(redirectFun,2000);\n    var iFramesList = document.getElementsByTagName("iframe");\n    var renderedIFramesCount = 0;\n    var numberOfIFrames = iFramesList.length;\n    for (var i = 0; i < iFramesList.length; i++) {\n        var iFrame = iFramesList[i];\n        bindEvent(iFrame, \'load\', function(){\n            renderedIFramesCount = renderedIFramesCount + 1;\n            if (renderedIFramesCount >= numberOfIFrames)\n            {\n                redirectFun();\n            }\n        });\n    }\n    var doRedirect = true;\n    function redirectFun() {\n        if (doRedirect)\n            window.location.href = "CANT POST THIS WEBSITE BECAUSE OF MY REPUTATION POINTS ON STACKOVERFLOW";\n        doRedirect = false;\n    }\n\n    function bindEvent(el, eventName, eventHandler) {\n        if (el.addEventListener){\n            el.addEventListener(eventName, eventHandler, false);\n        } else if (el.attachEvent){\n            el.attachEvent(eventName, eventHandler);\n        }\n    }\n</script>\n\n'

我只想获取原始网站的HTML。

2 个答案:

答案 0 :(得分:1)

您必须沿请求标头发送User-agent,以使网站相信该请求来自真实的Web浏览器。因此,如果您想要非重定向网址的内容,则您的代码应为

from requests import get
content = get('http://www.cell.com/cell-stem-cell/home', headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},allow_redirects = False).content
print content

输出将是:

The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getShar
edSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&co
de=cell-site">here</a>

如果您想要重定向网址的内容,则允许重定向,但包含用户代理标头。此方法适用于大多数不在其网站上使用动态内容的网站。如果您要从动态内容网站抓取数据,则必须使用selinium等网络浏览器模拟器。

答案 1 :(得分:0)

你只需要很少的工作来直接得到它。需要重定向时,服务器发送Location标头。您只需要访问该位置标题中的URL。

r1.content

您将在r1.textdef published_comment html = "<li> <div id='comment_#{@comment.anchor}' class='comment #{@comment.state}' data-comment-id='#{@comment.to_param}'> <div> #{ avatar } #{ userbar } <div class='cbody'>#{ @comment.content }</div> #{ reply } " html << approved_comment if controller.try(:current_user).try(:admin?) html << "</div> </div> <div class='form_holder'></div> #{ children } </li>" html end

中获得所需的数据