我目前正在尝试使用BeautifulSoup从discogs网站上获取一些信息,这些信息无法通过他们的API获得。很遗憾,我似乎无法通过urllib2
,httplib
或requests
连接到网站,而不会遇到BadStatusLine异常。
我认为这是因为http://www.discogs.com
被重定向到https://www.discogs.com
的任何请求。我已经能够通过使用以下代码确定有一个方向:
r_link = "http://www.discogs.com"
print "Trying " + r_link
r = requests.get(r_link, allow_redirects=False)
print(r.status_code, r.reason, r.history, r.headers['Location'])
返回:
Trying http://www.discogs.com
(301, 'Moved Permanently', [], 'https://www.discogs.com/')
如果我理解正确,这意味着对http://www.discogs.com
的任何请求都会被重定向到https://www.discogs.com
。因此,人们会认为明显的解决方案是立即向https://www.discogs.com
提出请求。好吧,不幸的是,使用上面的代码(即将s添加到r_link路径中)这样做会导致BadStatusCode错误......
Trying https://www.discogs.com
Traceback (most recent call last):
File "start.py", line 26, in <module>
r = requests.get(r_link, allow_redirects=False)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
从requests
文档中的示例中,处理https链接应该没有问题。实际上,在https://www.google.com
中使用网址时,使用302
尝试上述代码会产生r.headers['Location']
响应并成功重定向。
那么问题是什么?为什么会这样?这是由于我犯的错误吗?这可能是我的设备/设置特有的东西吗?这是特定于discogs&#39;服务器?我对如何诊断这个问题感到茫然。
感谢。
答案 0 :(得分:0)
添加用户代理,请求将正常工作:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
r_link = "https://www.discogs.com"
print ("Trying " + r_link)
r = requests.get(r_link,headers=h)
print(r.status_code, r.reason, r.history, r.headers)
print(r.content)
以下工作示例:
In [19]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [20]: r_link = "https://www.discogs.com"
In [21]: r = requests.get(r_link, headers=h)
In [22]: print(r.status_code, r.reason, r.history, r.headers)
(200, 'OK', [], {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'sid=fad997b268420522ac0242de41fc694c; Domain=www.discogs.com; Expires=Sun, 19-Apr-2026 17:04:09 GMT; Path=/, language2=en; Domain=www.discogs.com; Path=/, session="9H1LFLTWiCMSowA7nKbUYlHU4N8=?"; Domain=www.discogs.com; Secure; HttpOnly; Path=/', 'Server': 'nginx/1.8.1', 'Connection': 'keep-alive', 'Date': 'Thu, 21 Apr 2016 17:04:10 GMT', 'Content-Type': 'text/html; charset=utf-8'})
In [23]: from bs4 import BeautifulSoup
In [24]: soup.select("#email")
Out[24]: [<input autocaptialize="off" autocomplete="off" id="email" name="email" placeholder="Enter your email address" type="text"/>]
In [25]: soup.select("#username")
Out[25]: [<input autocaptialize="off" autocomplete="off" id="username" name="username" placeholder="Choose a username" type="text"/>]
如果您想登录:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
login = "https://www.discogs.com/login?return_to=%2F"
with requests.session() as s:
r = s.post(login, data={"username":"your_user","password":"your_pass","Action.Login":""}, headers=h)
print(r.content)
如果我们运行它,您会看到我们到达https://www.discogs.com/my
:
In [27]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [28]: login = "https://www.discogs.com/login?return_to=%2F"
In [29]: with requests.session() as s:
....: r = s.post(login, data={"username":"xxxxxxxx","password":"xxxxxxxx","Action.Login":""}, headers=h)
....: print(r.url)
....:
https://www.discogs.com/my