尝试使用python下载网站,但收到错误。我的目的是下载网站,使用python从中提取相关信息,将结果保存到我硬盘上的另一个文件中。在第1步遇到问题。其他步骤一直有效,直到出现一些奇怪的SSL错误我正在使用python 2.7
import urllib
testsite = urllib.URLopener()
testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")
这就是:
Traceback (most recent call last):
File "C:\Users\Xaero\Desktop\Python\class related\scratch.py", line 10, in <module>
testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")
File "C:\Python27\lib\urllib.py", line 237, in retrieve
fp = self.open(url, data)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 435, in open_https
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 940, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 803, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 755, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 1156, in connect
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
File "C:\Python27\lib\ssl.py", line 342, in wrap_socket
ciphers=ciphers)
File "C:\Python27\lib\ssl.py", line 121, in __init__
self.do_handshake()
File "C:\Python27\lib\ssl.py", line 281, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error
在网上进行了一些研究,结果发现Piratebay非常蟒蛇不友好。我找到了一些代码,它为它提供了一个不同的用户代理,并使它加载了页面,但是这个代码最近也停止了工作。 &GT; _&LT;
生成相同的错误:
import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice
today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']
class MyOpener(FancyURLopener, object):
version = choice(user_agents)
myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')
是否有人能够成功地完成这项工作?
答案 0 :(得分:0)
你尝试过使用硒吗?
pip install selenium
有关进一步的安装说明,请参阅here。
首先进口硒:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
然后启动webdriver并加载页面:
driver = webdriver.Firefox()
driver.get("https://thepiratebay.se/top/207")
答案 1 :(得分:0)
更新python install修复它。我想我有2.7.0,更新到2.7.11并且问题消失了。
现在完美无缺地检索页面:
import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice
today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']
class MyOpener(FancyURLopener, object):
version = choice(user_agents)
myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')
虽然,硒也很有趣。我会看看。谢谢你所有的帮助! = d