我试图了解使用urllib2阅读网页内容的不同情况,而且礼物网站上似乎有一些检查阻止我阅读所有的HTML。
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.gifts.com'
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
page = urllib2.urlopen(request)
soup = BeautifulSoup(page,'html.parser')
print soup
我过去遇到了类似的问题,但是通过添加“用户代理”来解决这个问题。标题,但这似乎是一些javascript检查阻止访问...结果是所有页面内容最多:
>>><script>var readyStateHandlerPDP = document.onreadystatechange;var AddPDPPrefetchFiles = function (SiteVersionData) {var _siteVersionNumber = SiteVersionData.GetSiteVersionNumber();var onDeferredLoadPDP = function () {/* append prefetch files for PDP to head */var head = $('head');head.append('<link rel="prefetch" href="//static.prvd.com/client/javascript/harmony/harmonytop.min.js?v=' + _siteVersionNumber + '">');head.append('<link rel="prefetch" href="//static.prvd.com/client/javascript/pdpcommon/pdpcommon.min.js?v=' + _siteVersionNumber + '">');head.append('<link rel="prefetch" href="//static.prvd.com/client/javascript/harmony/harmony.min.js?v=' + _siteVersionNumber + '">');head.append('<link rel="prefetch" href="//www.proflowers.com/product/controls/harmonytemplates/harmonytemplates.aspx?v=' + _siteVersionNumber + '">');};if (!readyStateHandlerPDP) {document.onreadystatechange = function () {if (document.readyState === "complete")onDeferredLoadPDP();}} else {readyStateHandlerPDP();onDeferredLoadPDP();}}(window.SiteVersionData);</script
<link href="http://static.prvd.com/client/stylesheets/widgets/pseudoproduct.css?v=2016.2.24.1" rel="stylesheet" type="text/css"/></meta></meta></meta></meta></meta></head></html>
页面突然结束,即使html继续,如果我去网站查看页面源。
我没有兴趣屏蔽我的身份或提出匿名请求,所以任何有关如何最好地模仿正常网页浏览以帮助通过此检查的帮助都会有所帮助。