我在python中使用html2text通过获取任何URL来获取HTML页面的原始文本(包括标签),但是我收到了错误。
我的代码 -
import html2text
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>@<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)
错误 -
Traceback (most recent call last):
File "t.py", line 8, in <module>
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
有谁能解释我做错了什么?
答案 0 :(得分:13)
如果您不需要SSL,则Python 2.7.x
中的此脚本应该有效:
import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()
并在Python 3.x
中使用urllib.request
代替urllib
因为Python 2的urllib2
,在Python 3中它被合并到urllib
。
http://
是必需的。