使用Python下载网站

时间:2016-01-10 23:58:06

标签: python

尝试使用python下载网站,但收到错误。我的目的是下载网站,使用python从中提取相关信息,将结果保存到我硬盘上的另一个文件中。在第1步遇到问题。其他步骤一直有效,直到出现一些奇怪的SSL错误我正在使用python 2.7

import urllib
testsite = urllib.URLopener()
testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")

这就是:

Traceback (most recent call last):
  File "C:\Users\Xaero\Desktop\Python\class related\scratch.py", line 10, in <module>
    testsite.retrieve("https://thepiratebay.se/top/207", "C:\file.html")
  File "C:\Python27\lib\urllib.py", line 237, in retrieve
    fp = self.open(url, data)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 435, in open_https
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 940, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 803, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 755, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 1156, in connect
    self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file)
  File "C:\Python27\lib\ssl.py", line 342, in wrap_socket
    ciphers=ciphers)
  File "C:\Python27\lib\ssl.py", line 121, in __init__
    self.do_handshake()
  File "C:\Python27\lib\ssl.py", line 281, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

在网上进行了一些研究,结果发现Piratebay非常蟒蛇不友好。我找到了一些代码,它为它提供了一个不同的用户代理,并使它加载了页面,但是这个代码最近也停止了工作。 &GT; _&LT;

生成相同的错误:

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

是否有人能够成功地完成这项工作?

2 个答案:

答案 0 :(得分:0)

你尝试过使用硒吗?

pip install selenium

有关进一步的安装说明,请参阅here

首先进口硒:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

然后启动webdriver并加载页面:

driver = webdriver.Firefox()
driver.get("https://thepiratebay.se/top/207")

答案 1 :(得分:0)

更新python install修复它。我想我有2.7.0,更新到2.7.11并且问题消失了。

现在完美无缺地检索页面:

import urllib2
import os
import datetime
import time
from urllib import FancyURLopener
from random import choice

today = datetime.datetime.today()
today = today.strftime('%Y.%m.%d')

user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12']


class MyOpener(FancyURLopener, object):
    version = choice(user_agents)

myopener = MyOpener()
page = myopener.retrieve('https://thepiratebay.se/top/207', 'C:\TPB.HDMovies' + today + '.html')

虽然,硒也很有趣。我会看看。谢谢你所有的帮助! = d