我无法使用urllib2打开一个特定网址。同样的方法适用于其他网站,例如“http://www.google.com”,但不适用于此网站(在浏览器中也能正常显示)。
我的简单代码:
from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup
任何人都可以帮助我让它发挥作用吗?
这是我得到的错误:
Traceback (most recent call last):
File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
response=urllib2.urlopen(url);
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
result = self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
谢谢
答案 0 :(得分:9)
我刚尝试了这个并收到了404代码和页面。
猜测它正在进行用户代理检测,无论是偶然还是故意都不会向python urllib提供内容。
澄清,urllib
,我收到urlopen
返回的响应对象,其中包含404代码和HTML内容。 urllib2.urlopen
urllib2.HTTPError
引发了{{1}}例外。
我建议你尝试将用户代理设置为看起来像浏览器的东西。这里有一个问题:Changing user agent on urllib2.urlopen
答案 1 :(得分:4)
您可以使用try except
捕获错误
try:
u = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.code
print e.msg
return
答案 2 :(得分:0)
嗯...你确定这个URL有效吗?尝试“http://www.google.com”我有类似的代码,urllib没有问题。或者您可以使用try - except语句查看错误的详细信息。当然,MattH的答案非常类似于真相:)