Question

我正在从某个https网页下载一些数据 https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html，所以我因HTTPS而收到此错误。当我手动将网页更改为HTTP时，它可以正常下载。我正在寻找类似的例子来解决这个问题，但我没有发现任何问题。你知道该怎么做吗？

Traceback (most recent call last):
  File "down.py", line 34, in <module>
    soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
  File "g:\python\Lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "g:\python\Lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "g:\python\Lib\urllib.py", line 443, in open_https
    h.endheaders(data)
  File "g:\python\Lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "g:\python\Lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "g:\python\Lib\httplib.py", line 855, in send
    self.connect()
  File "g:\python\Lib\httplib.py", line 1274, in connect
    server_hostname=server_hostname)
  File "g:\python\Lib\ssl.py", line 352, in wrap_socket
    _context=self)
  File "g:\python\Lib\ssl.py", line 579, in __init__
    self.do_handshake()
  File "g:\python\Lib\ssl.py", line 808, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:5
90)

这是我的计划：

#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8  
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.


import re, time, urllib, tinydb
from bs4 import BeautifulSoup 


start_time = time.time()



#Open file with urls.
with open("G:/myVE/vacancies/urls2.csv") as f:
    lines = f.readlines()

#Open file to write HTML to.
with open("G:/myVE/downloader/urls2_html.txt", 'wb') as g:

    #We parse the content of url file to get just urls without the first line and without the text.
    for line in lines[1:len(lines)]:

        #Read the url from the file.
        #url = line.split(",")[0]
        url = line

        print "test"

        #Read the HTML of url
        soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")

        print url
        #Mark of new HTML in HTML file.
        g.write("\n\nNEW HTML\n\n")

        #Write new HTML to file.
        g.write(str(soup))







print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)






"""
#We read HTML of the employment webpage that we intend to parse.
soup = BeautifulSoup(urllib.urlopen('http://www.simplybusiness.co.uk/about-us/careers/jobs/').read(), "html.parser")



#We write HTML to a file.
with open("E:/analitika/SURS/tutorial/tutorial/html.txt", 'wb') as f:
   f.write(str(soup)) 



print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)
"""

谢谢！

Answer 1

您应该使用requests库，请参阅http://docs.python-requests.org/en/latest/user/advanced/#ssl-cert-verification作为参考。

已更新以添加

现在，您的网址是requests库的示例。

import requests

url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html" 
r = requests.get(url, verify=True)
print(r.text)

以下是beautifulsoup和Python 3.3的示例，它似乎也有效。

import urllib
from bs4 import BeautifulSoup

url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html" 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser")
print(soup)

下载HTTPS网页时出错

1 个答案: