Question

我试图抓住＆＃34; Team Stats＆＃34;来自http://www.pro-football-reference.com/boxscores/201602070den.htm的表与BS4和Python 2.7。但是我无法接近它，

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})  
print table

我认为上面的代码会起作用，但没有运气。

Answer 1

此案例中的问题是“团队统计信息”表位于您使用mechanismFall下载的HTML源代码中的注释内。找到评论并用requests将其重新分析为“汤”对象：

BeautifulSoup

和/或，您可以将表加载到例如pandas dataframe中，这非常方便使用：

import requests
from bs4 import BeautifulSoup, NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)

打印：

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString

url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)

df = pd.read_html(comment)[0]
print(df)

用Python / BS4刮表

1 个答案: