我正在尝试使用Python在开放网站上的表格中进行网页抓取。我已检查以确保它将使用命令“ page_soup.p”连接到该站点,并返回带有“ p”标签的项目。
当我检查以确保抓取标签可与命令containers[0]
配合使用时,我会遇到:
回溯(最近通话最近一次)
文件“”,位于
的第1行IndexError:列表索引超出范围”
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://overwatchleague.com/en-us/stats'
# opening up connect, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each player
containers = page_soup.findAll("tr",{"class":"Table-row"})
该标签应该大约有183行,显然0不是我期望的。对我的不当行为有何见解?
答案 0 :(得分:2)
数据通过JSON加载。为了找出正确的网址,例如在Firefox开发人员工具中,页面建立了哪些网络连接:
import requests
from datetime import timedelta
url = 'https://api.overwatchleague.com/stats/players?stage_id=regular_season&season=2019'
data = requests.get(url).json()
print('{:^12}{:^12}{:^12}{:^20}'.format('Name', 'Team', 'Deaths', 'Time Played'))
print('-' * (12*3+20))
for row in data['data']:
print('{:^12}'.format(row['name']), end='')
print('{:^12}'.format(row['team']), end='')
print('{:^12.2f}'.format(row['deaths_avg_per_10m']), end='')
t = timedelta(seconds=float(row['time_played_total']))
print('{:>20}'.format(str(t)))
打印:
Name Team Deaths Time Played
--------------------------------------------------------
Ado WAS 5.47 15:23:08.217194
Adora HZS 3.72 9:08:57.586787
Agilities VAL 5.27 17:16:59.668653
Aid TOR 5.08 8:02:19.102897
AimGod BOS 4.69 17:04:31.769137
aKm DAL 4.64 16:57:14.261245
alemao BOS 4.99 2:36:25.171021
ameng CDH 6.24 16:06:12.084212
Anamo NYE 2.36 17:33:31.143450
Architect SFS 4.33 3:18:45.065564
ArHaN HOU 6.39 1:54:10.439213
ArK WAS 2.50 9:32:57.421203
...and so on.