试图抓取一个网站的数据,但得到一个错误403

时间:2017-07-23 17:27:42

标签: python web-scraping

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://www.csgoanalyst.win'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
page_soup.body

我正在努力搜索hltv.org,以便找出每个团队禁止和选择的地图。但是,我一直收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/anaconda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/anaconda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/anaconda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/anaconda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/anaconda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>> page_html = uClient.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'uClient' is not defined
>>> uClient.close()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'uClient' is not defined

我在另一个网站上试过这个脚本,所以我知道它有效。我认为hltv已经阻止了机器人或其他任何事情,我知道如果他们不想让人们这样做,我不应该特别这样做,但我很乐意得到数据。

任何帮助都会非常有帮助。 谢谢。

2 个答案:

答案 0 :(得分:0)

from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

my_url = 'https://www.hltv.org/stats/teams/maps/6665/Astralis'

u_client = uReq(my_url)

soup = bs.BeautifulSoup(u_client,"html.parser")

print soup

如果你想删除标签

import bleach

print bleach.clean(soup,tags = [],strip = True)

答案 1 :(得分:0)

我建议您使用requests模块而不是urllib。它速度快,还有其他优点。你被禁止了,因为你缺少一个User-Agent标题。尝试以下内容:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
my_url = 'https://www.hltv.org/stats/teams/maps/6665/Astralis'

page = requests.get(my_url, headers=headers)
page_html = page.text
可以使用requestspip

轻松安装

pip install requests

您也可以使用urllib添加标题,但它在语法上稍微复杂一些,也许更慢。