示例网址是' http://www.hockey-reference.com/players/c/crosbsi01/gamelog/2016'
我想抓住的牌桌名称是常规赛。
我在以前的例子中用来做的事情是这样的......
import requests
from bs4 import *
from bs4 import NavigableString
import pandas as pd
url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season Table" in x)
df = pd.read_html(comment)
这是我采用类似于此网站的方式的类型,但是,我无法使用此页面正确找到该表。不确定我错过了什么。
答案 0 :(得分:1)
There is one table which you can get using the id:
import requests
from bs4 import BeautifulSoup
url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
table = soup.select_one("#gamelog")
print(table)
or using just pandas:
df = pd.read_html(url, attrs = {'id': 'gamelog'})
Your code could never work as you are looking for a NavigableString which is inside a caption tag <caption>Regular Season Table</caption>
not the table, you would need to call *
.find_previous`* to get the table:
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season Table" in x)
table = comment.find_previous("table")
You could also use table = comment.parent.parent
but find_previous is a better approach.