我正在尝试从espn网站上刮取一张表。我似乎无法找到正确的名称来访问它。
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
soup.find_all('table',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
代码只给我一个空列表:(
答案 0 :(得分:0)
为什么不仅要获取flex类,然后获取玩家表。
import requests
from bs4 import BeautifulSoup
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
all_tables = soup.find('div', {'class':'flex'})
all_tables.find('table') # To get all players name
答案 1 :(得分:0)
您选择的标签:
soup.find_all('table',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
不应为'table'
,而应为'section'
:
soup.find_all('section',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
要获取所有数据,可以使用以下示例:
import requests
from bs4 import BeautifulSoup
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
for tr1, tr2 in zip(soup.select('table.Table.Table--align-right.Table--fixed.Table--fixed-left tr'),
soup.select('table.Table.Table--align-right.Table--fixed.Table--fixed-left ~ div tr')):
data = tr1.select('td') + tr2.select('td')
if not data:
continue
print('{:<25}'.format(data[1].get_text(strip=True, separator='-').split()[-1]), end=' ')
for td in data[2:]:
print('{:<6}'.format(td.get_text(strip=True)), end=' ')
print()
打印:
James-LAL SF 30 34.9 25.7 9.9 20.2 49.1 2.2 6.4 34.4 3.6 5.3 67.9 7.6 10.6 1.2 0.6 3.9 23 7 26.33
Rubio-PHX PG 25 31.8 13.8 5.0 12.2 41.0 1.1 3.7 30.1 2.7 3.2 84.8 4.8 9.2 1.2 0.2 2.6 11 1 16.30
Doncic-DAL SF 26 32.2 29.1 9.4 19.8 47.7 3.0 9.2 32.2 7.3 9.1 79.7 9.6 8.8 1.2 0.1 4.3 17 8 31.43
Simmons-PHI PG 32 34.9 14.3 5.9 10.4 56.3 0.1 0.2 40.0 2.5 4.3 58.3 7.0 8.6 2.2 0.6 3.7 15 2 18.92
Young-ATL PG 31 34.9 28.5 9.3 20.9 44.4 3.4 9.3 36.8 6.5 7.7 84.5 4.3 8.3 1.2 0.1 4.7 9 1 23.21
Graham-CHA PG 34 34.7 19.2 6.1 15.9 38.2 3.8 9.5 39.8 3.2 4.1 79.7 3.9 7.6 0.8 0.3 3.0 9 0 17.20
Brogdon-IND PG 26 31.4 18.3 6.6 14.5 45.2 1.4 4.3 33.3 3.8 4.0 93.3 4.5 7.6 0.9 0.2 2.7 7 0 20.31
Harden-HOU SG 31 37.6 38.1 11.1 24.5 45.2 5.1 13.8 37.2 10.9 12.4 87.5 5.8 7.5 1.9 0.7 4.7 9 0 31.72
Lillard-POR PG 30 36.7 26.9 8.4 19.0 44.3 3.4 9.4 35.8 6.6 7.4 89.6 4.2 7.5 1.0 0.4 2.9 6 0 24.42
Westbrook-HOU PG 28 35.3 24.1 8.9 20.9 42.6 1.2 5.1 23.8 5.1 6.5 79.1 8.1 7.1 1.5 0.4 4.4 12 6 18.68
VanVleet-TOR SG 26 36.3 18.1 5.9 14.5 40.5 2.4 6.6 36.8 3.9 4.5 87.2 3.9 7.0 2.0 0.2 2.6 5 0 16.82
Jokic-DEN C 30 31.3 17.6 7.0 14.4 48.5 1.3 4.1 30.6 2.4 3.0 82.0 10.0 6.8 1.0 0.6 2.5 17 6 23.01
...and so on.
答案 2 :(得分:0)
您还可以使用网页用来使用播放器信息填充其表的相同API。如果您直接对该API进行GET请求(使用正确的标头和查询字符串),则将以JSON兼容格式接收所有可能需要的播放器信息。
API的URL,相关标头和查询字符串GET-Parameters在Google Chrome的网络日志中都是可见的(大多数现代浏览器都具有等效功能)。我可以通过应用过滤器并仅保留XMLHttpRequest(XHR)资源,然后单击表底部的“显示更多”按钮来找到它们。
我将"limit"
的GET-Parameter设置为"3"
,因为我只对打印与前三个播放器有关的数据感兴趣。例如,将此字符串更改为"50"
,将查询前50个播放器的API。
def main():
import requests
headers = {
"accept": "application/json, text/plain, */*",
"origin": "https://www.espn.com",
"user-agent": "Mozilla/5.0"
}
params = {
"region": "us",
"lang": "en",
"contentorigin": "espn",
"isqualified": "true",
"page": "1",
"limit": "3",
"sort": "offensive.avgAssists:desc"
}
base_url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete"
response = requests.get(base_url, headers=headers, params=params)
response.raise_for_status()
data = response.json()
print(data["athletes"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
答案 3 :(得分:0)
如果您有table
标签,请让Pandas
为您完成工作。它在引擎盖下使用BeautifulSoup。
import pandas as pd
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
dfs = pd.read_html(url)
df = dfs[0].join(dfs[1])
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
输出:
print(df.head(5).to_string())
RK Name POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER Team
0 1 LeBron James SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10 LAL
1 2 Ricky Rubio PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40 PHX
2 3 Luka Doncic SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74 DAL
3 4 Ben Simmons PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49 PHI
4 5 Trae Young PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47 ATL