将表数据刮入Dataframe

时间:2016-10-20 17:59:36

标签: python beautifulsoup bs4

示例网址是' http://www.hockey-reference.com/players/c/crosbsi01/gamelog/2016'

我想抓住的牌桌名称是常规赛。

我在以前的例子中用来做的事情是这样的......

import requests
from bs4 import *
from bs4 import NavigableString
import pandas as pd


url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season  Table" in x)
df = pd.read_html(comment)

这是我采用类似于此网站的方式的类型,但是,我无法使用此页面正确找到该表。不确定我错过了什么。

1 个答案:

答案 0 :(得分:1)

There is one table which you can get using the id:

import requests
from bs4 import BeautifulSoup


url = 'http://www.hockey-reference.com/players/o/ovechal01/gamelog/2016'
resultsPage = requests.get(url)
soup = BeautifulSoup(resultsPage.text, "html5lib")
table = soup.select_one("#gamelog")
print(table)

or using just pandas:

 df = pd.read_html(url, attrs = {'id': 'gamelog'})

Your code could never work as you are looking for a NavigableString which is inside a caption tag <caption>Regular Season Table</caption> not the table, you would need to call *.find_previous`* to get the table:

comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "Regular Season  Table" in x)
table = comment.find_previous("table")

You could also use table = comment.parent.parent but find_previous is a better approach.