Question

早安全，

我一直试图通过Python 2.7访问一个网站，但是无法访问内容，而且几天的研究都没有帮助。该网站是：https://www.cioh.org.co/。在Python中，我希望能够访问该页面并检索所有HTML内容。在过去，我使用ssl模块并在顶部添加以下代码行：

导入ssl ssl._create_default_https_context = ssl._create_unverified_context

这一次，这不起作用，我收到错误：SSLError：[SSL：CERTIFICATE_VERIFY_FAILED]证书验证失败（_ssl.c：661）在请求模块中使用requests.get（'https://www.cioh.org.co/'）

时

某些网站上有人指出使用：导入请求 r = requests.get（URL，verify = False） print r.text

我也试过了，但实际上并没有抓住内容。它只是从网站上检索内部标题信息：

<html>

<head>

<META NAME="robots" CONTENT="noindex,nofollow">

<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">

</script>

<body>

</body></html>

印刷的回复与网站完全不同。通过无数的研究，我尝试使用certifi模块。我还安装了OpenSSL并提取了.crt，.key和.pem文件（并尝试使用它们）但仍然没有运气。如果需要，我可以扩展我已经完成的进一步研究。

该网站，如果使用任何浏览器都可以轻松访问。任何帮助将不胜感激。

旁注：这是我第一次创建帐户并提出问题。如果我不清楚任何事情，请告诉我。提前谢谢。

Answer 1

根据回复中的else: is_indivisible = True # loop through all numbers less than it not including itself # (because x % x == 0) for value in range(2, num - 1): # it is only indivisible if it was previously indivisible # And the check is same as before, modulo != 0 is_indivisible = is_indivisible and (num % value != 0) if not is_indivisible: break # if it is indivisible and it doesn't exist in prime list yet if is_indivisible and num not in prime: prime.append(num) # move on to the next number num += 1判断，您的请求被WAF阻止。

您可以尝试更改Incapsula_Resource调用中的用户代理字符串，使其看起来更像普通浏览器，但该网站的所有者显然不希望自动脚本抓取他们的网页。

Answer 2

显然你的代码必须以某种方式模仿浏览器，所以我认为你可以这样做：

from selenium import webdriver


def scrape_page(url):
    browser = webdriver.Firefox()
    browser.get(url)
    content = browser.page_source
    browser.close()
    return content


if __name__ == "__main__":
    print(scrape_page('https://www.cioh.org.co/'))

实施非常笨拙，但它确实有效，我希望你明白这一点。

为了实现这一目标，您必须安装geckodriver，这里是instructions的链接。要安装selenium，只需输入：pip3 install selenium

Python2.7。访问HTTPS网站并检索内容

2 个答案: