Python请求和福布斯'欢迎'页面重定向

时间:2016-08-30 22:58:42

标签: javascript python selenium beautifulsoup python-requests

请求是否可以通过福布斯欢迎页面导航?我试图访问这篇文章

http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/

对于大多数人来说,最终会有一个启动画面欢迎页面,然后重定向到实际的文章本身。我在Chrome中注意到文章的网址在解析为实际文章后会附加一个值,但每次都是随机的。

http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/#216cc0922071

我有一种感觉,这可能涉及cookie,但到目前为止,我的代码除了构成欢迎页面的html之外没有抓取任何HTML。

url = 'http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/'
hdrs = {"User-Agent": 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
session = requests.session()
text = session.get(url, headers=hdrs, allow_redirects=True)
print ('headers', text.headers)
print ('cookies', requests.utils.dict_from_cookiejar(session.cookies))
print ('html',  text.text)

输出

headers {'Content-Type': 'text/html;charset=utf-8', 'Backend': 'templates', 'Date': 'Tue, 30 Aug 2016 22:37:15 GMT', 'Connection': 'keep-alive', 'Accept-Ranges': 'bytes', 'Content-Language': 'en-US', 'X-Cnection': 'close', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Length': '1983', 'Server': '', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip'}
cookies {'forbesbeta': 'A'}
html <!DOCTYPE html><html class="no-js" lang=""><head><title>Forbes Welcome</title><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=2"><meta name="description" content="Forbes Welcome page -- Forbes is a global media company, focusing on business, investing, technology, entrepreneurship, leadership, and lifestyle."><meta name="keywords" content="business news, market analysis, company profiles, personal finance, management, entrepreneurship, investments, financial advice, economy, technology news"><link rel="stylesheet" href="http://i.forbesimg.com/welcomead/styles/abd4e3d6.main.css"><script type="text/javascript">fbs_settings = {
                mobile: 'false',
                preview: 'false',
                test: 'false',
                classes: 'WyJwYWdlR29vZ2xlQWRTdWJjb250ZW50IiwiYWRoaSIsImFkX2tleXdvcmRzX2JvdF9yIiwiZ29vZ2xlLWFkLWFmYy1oZWFkZXIiLCJhcnRpY2xlX2JvdHRvbV9hZCIsImFkc1lOIiwidG9wQWRXcmFwcGVyIiwicmVnaW9uLW1pZGRsZS1hZCIsImFkc0RpdiIsInNfYWQyIiwiYWR3b3JkLWJveCIsImpzLWFkLWltdSIsImFkLXNwb25zb3JlZC1wb3N0IiwiY2VudGVyQWQiLCJiei1hZCIsImFkLTcyOHg5MCIsImdwdC1hZHMiLCJzcG9uc29yLXRleHQtY29udGFpbmVyIiwiYWRfcmVjdGFuZ3VsYXIiLCJob21lQWRCb3hJbkJpZ25ld3MiLCJwb3NfYWR2ZXJ0IiwiY29udGFpbnMtYWQiLCJ0b3AtYWRzZW5zZS1iYW5uZXIiLCJwYWdlSGVhZGVyQWQiLCJibG9jay1zcG9uc29yZWQtbGlua3MiLCJhZDI1MC1oMSIsImNoYW5nZV9BZENvbnRhaW5lciIsImFkX2dyaWQiLCJzcG9uc29yLXNlcnZpY2VzIiwidmlld19hZHNfYm90dG9tX2JnIl0='
            };</script><script type="text/javascript">try {
                fbs_settings.data = {"channel":"channel_0","section":"section_0","location":"welcomead_default","panel":"welcome_ad","contentPositions":[{"position":1,"title":"Quote of the Day","description":"\"Success is a terrible thing and a wonderful thing... Just do what you love.”","following":false,"byline":"Gene Wilder","hideDescription":false,"sponsored":false,"twitterHandle":"","hashtag":""}],"panelId":"panel4","limit":0,"swimlane":false,"more":false,"enableAds":false,"removeBVPrepend":false,"brandvoiceHeader":false,"profileLink":false,"fullListLink":false,"pagination":false,"filters":false,"year":0};
            } catch (err) {
                fbs_settings.data = null;
            }</script><script type="text/javascript">try {
                fbs_settings.angular_preload = ["//i.forbesimg.com/forbes/scripts/c632bd7f.vendor.js","//i.forbesimg.com/forbes/scripts/99f3b378.scripts.js","//i.forbesimg.com/forbes/styles/860430fd.main.css"];
            } catch (err) {
                fbs_settings.angular_preload = null;
            }</script><script src="http://i.forbesimg.com/welcomead/scripts/vendor/69216742.modernizr.js"></script></head><body><div id="app" class="container clearfix default-template ad-300-by-250"><div id="navigation"></div><div id="content"><div id="adblock-hover" class="hidden"><span class="close-btn preloaded"><span class="close">CLOSE</span> <i class="icon icon-close"></i></span> <img> <a href="//www.forbes.com/adblock/instructions/" target="_blank">More Options</a></div>  <script>(function() {
                        setTimeout(function() {
                            var inviEles = document.getElementsByClassName('invisible');
                            for (var ele in inviEles) {
                                if (!inviEles[0]) {
                                    return;
                                }
                                inviEles[0].className = inviEles[0].className.replace('invisible', '');
                            }
                            if (window.performance && performance.mark) {
                                performance.mark('content_visible');
                            }
                        });
                    })();</script><div class="content-container"><div class="content-inner"><h1 class="title">  <i class="invisible branding icon icon-forbes-logo"></i> <span class="top invisible">Quote of</span> <span class="bottom invisible">the Day</span></h1><div class="body">  <p class="body-content invisible">"Success is a terrible thing and a wonderful thing... Just do what you love.”</p>  <p class="body-byline invisible">Gene Wilder</p>  </div></div></div><div class="circle-wrapper"><div class="circle invisible"></div><img class="circle fallback hidden" src="http://i.forbesimg.com/welcomead/images/circle.png"></div>  </div><div id="ads"></div></div><!--[if lte IE 9]>
        <script src="http://i.forbesimg.com/welcomead/scripts/b9b8347c.legacy.js"></script>
        <![endif]--><script src="http://i.forbesimg.com/welcomead/scripts/1a364ca6.vendor.js"></script><script src="http://i.forbesimg.com/welcomead/scripts/8951c3c8.main.js"></script></body></html>

我认为,由于浏览器最终可以解析文章,请求也应该能够,但是因为我无法解决福布斯正在做的事情,我无法解决如何设计适当的请求参数。有什么想法吗?

1 个答案:

答案 0 :(得分:1)

我当时从未打扰过,但是在一个不同的项目中使用过Selenium,并且有一个用户请求提供答案,所以这里是使用selenium来通过福布斯启动页面的基础知识。

你需要为selenium安装一个驱动程序,无论是firefox驱动程序,chrome驱动程序还是无头的PhantomJS。如果您使用的是Mac,可以通过Homebrew轻松安装chromedriver,或将单个PhantomJS驱动程序文件复制到#comment

中指示的路径
from selenium import webdriver
url = 'http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/'
browser = webdriver.Chrome() # or webdriver.PhantomJS('usr/bin/phantomjs')

browser.get(url)
browser.implicitly_wait(5)
browser.find_element_by_xpath('/html/body/div/div[1] /div/div[1]').click() #  a very explicit xpath to the continue button

# now grab whatever you want from the resulting code using...

browser.find_element_by_css_selector('css selector info').get_attribute('innerHTML')
browser.find_element_by_xpath('xpath info').get_attribute('innerHTML') 
# 'innerHTML grabs whatever the tags you select are surrounding, but other attributes are also possible such as ('href') on an <a> tag.