请求是否可以通过福布斯欢迎页面导航?我试图访问这篇文章
http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/
对于大多数人来说,最终会有一个启动画面欢迎页面,然后重定向到实际的文章本身。我在Chrome中注意到文章的网址在解析为实际文章后会附加一个值,但每次都是随机的。
http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/#216cc0922071
我有一种感觉,这可能涉及cookie,但到目前为止,我的代码除了构成欢迎页面的html之外没有抓取任何HTML。
url = 'http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/'
hdrs = {"User-Agent": 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
session = requests.session()
text = session.get(url, headers=hdrs, allow_redirects=True)
print ('headers', text.headers)
print ('cookies', requests.utils.dict_from_cookiejar(session.cookies))
print ('html', text.text)
输出
headers {'Content-Type': 'text/html;charset=utf-8', 'Backend': 'templates', 'Date': 'Tue, 30 Aug 2016 22:37:15 GMT', 'Connection': 'keep-alive', 'Accept-Ranges': 'bytes', 'Content-Language': 'en-US', 'X-Cnection': 'close', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Length': '1983', 'Server': '', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip'}
cookies {'forbesbeta': 'A'}
html <!DOCTYPE html><html class="no-js" lang=""><head><title>Forbes Welcome</title><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=2"><meta name="description" content="Forbes Welcome page -- Forbes is a global media company, focusing on business, investing, technology, entrepreneurship, leadership, and lifestyle."><meta name="keywords" content="business news, market analysis, company profiles, personal finance, management, entrepreneurship, investments, financial advice, economy, technology news"><link rel="stylesheet" href="http://i.forbesimg.com/welcomead/styles/abd4e3d6.main.css"><script type="text/javascript">fbs_settings = {
mobile: 'false',
preview: 'false',
test: 'false',
classes: 'WyJwYWdlR29vZ2xlQWRTdWJjb250ZW50IiwiYWRoaSIsImFkX2tleXdvcmRzX2JvdF9yIiwiZ29vZ2xlLWFkLWFmYy1oZWFkZXIiLCJhcnRpY2xlX2JvdHRvbV9hZCIsImFkc1lOIiwidG9wQWRXcmFwcGVyIiwicmVnaW9uLW1pZGRsZS1hZCIsImFkc0RpdiIsInNfYWQyIiwiYWR3b3JkLWJveCIsImpzLWFkLWltdSIsImFkLXNwb25zb3JlZC1wb3N0IiwiY2VudGVyQWQiLCJiei1hZCIsImFkLTcyOHg5MCIsImdwdC1hZHMiLCJzcG9uc29yLXRleHQtY29udGFpbmVyIiwiYWRfcmVjdGFuZ3VsYXIiLCJob21lQWRCb3hJbkJpZ25ld3MiLCJwb3NfYWR2ZXJ0IiwiY29udGFpbnMtYWQiLCJ0b3AtYWRzZW5zZS1iYW5uZXIiLCJwYWdlSGVhZGVyQWQiLCJibG9jay1zcG9uc29yZWQtbGlua3MiLCJhZDI1MC1oMSIsImNoYW5nZV9BZENvbnRhaW5lciIsImFkX2dyaWQiLCJzcG9uc29yLXNlcnZpY2VzIiwidmlld19hZHNfYm90dG9tX2JnIl0='
};</script><script type="text/javascript">try {
fbs_settings.data = {"channel":"channel_0","section":"section_0","location":"welcomead_default","panel":"welcome_ad","contentPositions":[{"position":1,"title":"Quote of the Day","description":"\"Success is a terrible thing and a wonderful thing... Just do what you love.”","following":false,"byline":"Gene Wilder","hideDescription":false,"sponsored":false,"twitterHandle":"","hashtag":""}],"panelId":"panel4","limit":0,"swimlane":false,"more":false,"enableAds":false,"removeBVPrepend":false,"brandvoiceHeader":false,"profileLink":false,"fullListLink":false,"pagination":false,"filters":false,"year":0};
} catch (err) {
fbs_settings.data = null;
}</script><script type="text/javascript">try {
fbs_settings.angular_preload = ["//i.forbesimg.com/forbes/scripts/c632bd7f.vendor.js","//i.forbesimg.com/forbes/scripts/99f3b378.scripts.js","//i.forbesimg.com/forbes/styles/860430fd.main.css"];
} catch (err) {
fbs_settings.angular_preload = null;
}</script><script src="http://i.forbesimg.com/welcomead/scripts/vendor/69216742.modernizr.js"></script></head><body><div id="app" class="container clearfix default-template ad-300-by-250"><div id="navigation"></div><div id="content"><div id="adblock-hover" class="hidden"><span class="close-btn preloaded"><span class="close">CLOSE</span> <i class="icon icon-close"></i></span> <img> <a href="//www.forbes.com/adblock/instructions/" target="_blank">More Options</a></div> <script>(function() {
setTimeout(function() {
var inviEles = document.getElementsByClassName('invisible');
for (var ele in inviEles) {
if (!inviEles[0]) {
return;
}
inviEles[0].className = inviEles[0].className.replace('invisible', '');
}
if (window.performance && performance.mark) {
performance.mark('content_visible');
}
});
})();</script><div class="content-container"><div class="content-inner"><h1 class="title"> <i class="invisible branding icon icon-forbes-logo"></i> <span class="top invisible">Quote of</span> <span class="bottom invisible">the Day</span></h1><div class="body"> <p class="body-content invisible">"Success is a terrible thing and a wonderful thing... Just do what you love.”</p> <p class="body-byline invisible">Gene Wilder</p> </div></div></div><div class="circle-wrapper"><div class="circle invisible"></div><img class="circle fallback hidden" src="http://i.forbesimg.com/welcomead/images/circle.png"></div> </div><div id="ads"></div></div><!--[if lte IE 9]>
<script src="http://i.forbesimg.com/welcomead/scripts/b9b8347c.legacy.js"></script>
<![endif]--><script src="http://i.forbesimg.com/welcomead/scripts/1a364ca6.vendor.js"></script><script src="http://i.forbesimg.com/welcomead/scripts/8951c3c8.main.js"></script></body></html>
我认为,由于浏览器最终可以解析文章,请求也应该能够,但是因为我无法解决福布斯正在做的事情,我无法解决如何设计适当的请求参数。有什么想法吗?
答案 0 :(得分:1)
我当时从未打扰过,但是在一个不同的项目中使用过Selenium,并且有一个用户请求提供答案,所以这里是使用selenium来通过福布斯启动页面的基础知识。
你需要为selenium安装一个驱动程序,无论是firefox驱动程序,chrome驱动程序还是无头的PhantomJS。如果您使用的是Mac,可以通过Homebrew轻松安装chromedriver,或将单个PhantomJS驱动程序文件复制到#comment
from selenium import webdriver
url = 'http://www.forbes.com/sites/andygreenberg/2012/10/15/how-i-accidentally-helped-compromise-the-secret-keys-of-high-security-handcuffs/'
browser = webdriver.Chrome() # or webdriver.PhantomJS('usr/bin/phantomjs')
browser.get(url)
browser.implicitly_wait(5)
browser.find_element_by_xpath('/html/body/div/div[1] /div/div[1]').click() # a very explicit xpath to the continue button
# now grab whatever you want from the resulting code using...
browser.find_element_by_css_selector('css selector info').get_attribute('innerHTML')
browser.find_element_by_xpath('xpath info').get_attribute('innerHTML')
# 'innerHTML grabs whatever the tags you select are surrounding, but other attributes are also possible such as ('href') on an <a> tag.