HTTPS阻止Python3中的网站抓取

时间:2018-09-29 16:39:07

标签: python https web-scraping

我正在尝试在tutorial之后使用Python代码来抓取网站,但是此后此网站已使用“ https”保护,并且在运行代码时返回以下错误。

enter image description here

extension UINotificationFeedbackGenerator {
    class func playFeedback(isSuccess: Bool = true) {
        let feedbackGenerator = UINotificationFeedbackGenerator()
        feedbackGenerator.prepare()
        feedbackGenerator.notificationOccurred(isSuccess ? .success : .error)
    }
}

2 个答案:

答案 0 :(得分:1)

您可以尝试将其添加到代码中吗?这应该绕过ssl验证。

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

答案 1 :(得分:1)

这里的问题是URL具有适当的防刮擦保护,可以阻止以编程方式提取HTML

尝试requests获取完整信息

import requests 
from bs4 import BeautifulSoup

#specify the url
quote_page = 'https://www.bloomberg.com/quote/SPX:IND'
result = requests.get(quote_page)
print (result.headers)
#parse the html using beautiful soup and store in variable `soup`
c = result.content
soup = BeautifulSoup(c,"lxml")

print (soup)

输出

{'Cache-Control': 'private, no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html, text/html; charset=utf-8', 'ETag': 'W/"5bae6ca0-97f"', 'Last-Modified': 'Fri, 28 Sep 2018 18:02:08 GMT', 'Server': 'nginx', 'Accept-Ranges': 'bytes, bytes', 'Age': '0, 0', 'Content-Length': '1174', 'Date': 'Sat, 29 Sep 2018 17:03:02 GMT', 'Via': '1.1 varnish', 'Connection': 'keep-alive', 'X-Served-By': 'cache-fra19128-FRA', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1538240583.834133,VS0,VE107', 'Vary': ', Accept-Encoding'}
<html>
<head>
<title>Terms of Service Violation</title>
<style rel="stylesheet" type="text/css">
        .container {
            font-family: Helvetica, Arial, sans-serif;
        }
    </style>
<script>
        window._pxAppId = "PX8FCGYgk4";
        window._pxJsClientSrc = "/8FCGYgk4/init.js";
        window._pxFirstPartyEnabled = true;
        window._pxHostUrl = "/8FCGYgk4/xhr";
        window._pxreCaptchaTheme = "light";

        function qs(name) {
            var search = window.location.search;
            var rx = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)");
            var match = rx.exec(search);
            return match ? decodeURIComponent(match[2].replace(/\+/g, " ")) : null;
        }
    </script>
</head>
<body>
<div class="container">
<img src="https://www.bloomberg.com/graphics/assets/img/BB-Logo-2line.svg" style="margin-bottom: 40px;" width="310"/>
<h1 class="text-center" style="margin: 0 auto;">Terms of Service Violation</h1>
<p>Your usage has been flagged as a violation of our <a href="http://www.bloomberg.com/tos" rel="noopener noreferrer" target="_blank">terms of service</a>.
    </p>
<p>
        For inquiries related to this message please <a href="http://www.bloomberg.com/feedback">contact support</a>.
        For sales
        inquiries, please visit <a href="http://www.bloomberg.com/professional/request-demo">http://www.bloomberg.com/professional/request-demo</a>
</p>
<h3 style="margin: 0 auto;">
        If you believe this to be in error, please confirm below that you are not a robot by clicking "I'm not a robot"
        below.</h3>
<br/>
<div id="px-captcha" style="width: 310px"></div>
<br/>
<h3 style="margin: 0 auto;">Please make sure your browser supports JavaScript and cookies and
        that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie
        Policy.</h3>
<br/>
<h3 id="block_uuid" style="margin: 0 auto; color: #C00;">Block reference ID: </h3>
<script src="/8FCGYgk4/captcha/captcha.js?a=c&amp;m=0"></script>
<script type="text/javascript">document.getElementById("block_uuid").innerText = "Block reference ID: " + qs("uuid");</script>
</div>
</body>
</html>

顺便说一句,如果您是学生,则可以在下载方面注册受限帐户。