Question

试图从特定电子竞技网站上的表中获取数据，我似乎很挣扎。

有人告诉我们，熊猫库只需几行就能帮助我实现这一目标。

import pandas as pd


tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9')

print(tables[0])

我尝试对其进行编辑以使我的工作正常，但是我没有成功。

import pandas as pd

from urllib.request import Request, urlopen

req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9',     headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9)

print(tables[0])

我被认为这可能是我正在寻找的解决方案，或与之类似的解决方案，但是当我尝试以这种方式解决问题时，我没有成功。

"Traceback (most recent call last):
  File "C:\Users\antho\OneDrive\Documents\Python\tables clloud9.py", line 6, in <module>
webpage = urlopen(req).read()
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden"

此刻我想要的是将链接上的表拉出。

Answer 1

这可能是因为阻止已知蜘蛛/机器人用户代理Urlib的服务器安全功能可以通过反抓取工具（尤其是使用的标头）轻松发现和阻止。尝试将找到的here的用户代理之一传递到标头中，看看其中一个是否有效。

但是，在您的特定情况下，robots.txt file在统计信息页面上禁止抓取工具，因此它们可能会阻止所有已知的抓取工具，包括Urllib。

按照示例here尝试使用Selenium进行剪贴。硒看起来更像是用户而不是抓取工具，因此当您收到HTTP错误403：禁止时，它通常（至少对我来说）用作解决方法。

Answer 2

import pandas as pd

from urllib.request import Request, urlopen

req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9',     headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9') #here was the err

print(tables[0])

为什么Pandas出现HTTP 403错误？

2 个答案: