为什么我无法从此网站接收数据?

时间:2015-01-13 23:26:27

标签: python lxml python-requests

我正在尝试最终制作解析特定网站的html的程序,但是我想要使用的网站出现错误的状态行错误。此代码适用于我尝试过的任何其他网站。这是他们故意做的事情,我无能为力吗?

我的代码:

from lxml import html
import requests

webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage)
tree = html.fromstring(page.text)

我收到的错误讯息:

Traceback (most recent call last):
  File "/home/kyle/Documents/web.py", line 6, in <module>
    page = requests.get(webpage)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 461, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))

1 个答案:

答案 0 :(得分:1)

提供User-Agent标题,它适合您:

webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
page = requests.get(webpage, 
                    headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})

证明:

>>> from lxml import html
>>> import requests
>>> 
>>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul'
>>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
>>> tree = html.fromstring(page.content)
>>> tree.findtext('.//title')
Search Results for "de la soul" | WhoSampled

仅供参考,如果您切换到 https

,它也会有效
>>> webpage = 'https://www.whosampled.com/search/?q=de+la+soul' 
>>> page = requests.get(webpage)
>>> tree = html.fromstring(page.content) 
>>> tree.findtext('.//title')                                                                                                                     
'Search Results for "de la soul" | WhoSampled'