Question

我想从网站获取HTML代码并将其写入文件。它适用于http网站，但如果有SSL链接，那么我会收到很多错误。知道怎么处理吗？

from __future__ import print_function
import io
import os
import re
import ssl
from urllib.request import urlopen

    with io.open('words.txt', 'a',encoding="utf-8") as g:
        url = "https://www.something.some"
        html = urlopen(url).read()
        print(html, file = g)

这里的错误

Traceback (most recent call last):
  File "...\Desktop\mined.py", line 54, in <module>
    html = urlopen(url).read()
  File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "....\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 472, in open
    response = meth(req, response)
  File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 510, in error
    return self._call_chain(*args)
  File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "...\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

我会这样做：

import urllib

resp = urllib.urlopen('https://somewebsite.com') # open url
page = resp.read()                               # copy website source to 'page' variable
text_file = open("Output.txt", "w")              # open txt file
text_file.write(page)                            # insert website source into txt file
text_file.close()                                # close file

Answer 2

urllib.error.HTTPError：HTTP错误403：禁止

错误403 Forbidden表示您已成功连接到网站，但网络服务器明确拒绝向您提供内容。可能服务器不希望您使用https访问该站点，并且当使用浏览器访问相同的URL时，您可能会遇到相同的错误。也可能是服务器尚未针对https正确配置。

如果您可以使用浏览器访问完全相同的URL但不能访问您的脚本，则可能是服务器基于User-Agent或其他内容进行过滤（即防止非浏览器访问）。在这种情况下，了解网站的真实网址将有助于您做到更好。

从Python中的htttps站点获取HTML内容

2 个答案: