我正在尝试从以下网页上的表中抓取数据:
http://ontariohockeyleague.com/stats/players/60
这是我到目前为止编写的代码。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://ontariohockeyleague.com/stats/players/60'
#open webpage, read html, close webpage
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
问题是,据我所知,该表实际上未包含在html代码中。通过检查网页,该表位于该主块中,但是出于任何原因,BeautifulSoup都不读取它。
page_soup.main
<main class="container">
<div class="container-content" data-feed_key="2976319eb44abe94" data-is-league="1" data-lang="en" data-league="ohl" data-league-code="" data-pagesize="100" data-season="63" id="stats"></div>
</main>
如果我查看页面源,则它也不包含表,而仅包含上面的主块。我还将其他解析器与BeautifulSoup一起使用,它返回的结果相同。
如何访问表格?
答案 0 :(得分:1)
该表是使用Javascript呈现的,因此不会出现在由urllib加载的初始HTML中。您可以找到页面正在使用的API并从那里获取数据,也可以使用无头浏览器获取完整的Javascript呈现的HTML。
答案 1 :(得分:0)
从网络检查器看来,该页面是从http://lscluster.hockeytech.com/feed/
以JSON格式动态加载的。为了获取任何数据,它需要来自主站点的密钥。示例在此处(数据存储在变量seasons_data
,teamsbyseason_data
,statviewtype_data
中):
import requests
from bs4 import BeautifulSoup
import json
url = "http://ontariohockeyleague.com/stats/players/60"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
seasons_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=seasons&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&fmt=json"
teamsbyseason_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=teamsbyseason&key=%s&fmt=json&client_code=ohl&lang=en&season_id=60&league_code=&fmt=json"
statviewtype_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=statviewtype&type=topscorers&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&season_id=60&first=0&limit=100&sort=active&stat=all&order_direction="
key = soup.find('div', id='stats')['data-feed_key']
r = requests.get(seasons_url % key)
seasons_data = json.loads(r.text)
r = requests.get(teamsbyseason_url % key)
teamsbyseason_data = json.loads(r.text)
r = requests.get(statviewtype_url % key)
statviewtype_data = json.loads(r.text)
# print(json.dumps(seasons_data, indent=4, sort_keys=True))
# print(json.dumps(teamsbyseason_data, indent=4, sort_keys=True))
print(json.dumps(statviewtype_data, indent=4, sort_keys=True))
打印:
{
"SiteKit": {
"Copyright": {
"powered_by": "Powered by HockeyTech.com",
"powered_by_url": "http://hockeytech.com",
"required_copyright": "Official statistics provided by Ontario Hockey League",
"required_link": "http://leaguestat.com"
},
"Parameters": {
"client_code": "ohl",
"feed": "modulekit",
"first": "0",
"fmt": "json",
"key": "2976319eb44abe94",
"lang": "en",
"lang_id": 1,
"league_code": "",
"league_id": "1",
"limit": "100",
"order_direction": "",
"season_id": 60,
"sort": "active",
"stat": "all",
"team_id": 0,
"type": "topscorers",
"view": "statviewtype"
},
... and so on...