用Python / BS4刮表

时间:2016-07-25 18:35:18

标签: python python-2.7 beautifulsoup

我试图抓住" Team Stats"来自http://www.pro-football-reference.com/boxscores/201602070den.htm的表与BS4和Python 2.7。但是我无法接近它,

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})  
print table

我认为上面的代码会起作用,但没有运气。

1 个答案:

答案 0 :(得分:1)

此案例中的问题是“团队统计信息”表位于您使用mechanismFall下载的HTML源代码中的注释内。找到评论并用requests将其重新分析为“汤”对象:

BeautifulSoup

和/或,您可以将表加载到例如pandas dataframe中,这非常方便使用:

import requests
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)

打印:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

df = pd.read_html(comment)[0]
print(df)