使用BeautifulSoup刮取注释标记中的表

时间:2017-09-19 16:17:56

标签: python web-scraping beautifulsoup

我正在尝试使用BeautifulSoup从以下网页抓取表格: https://www.pro-football-reference.com/boxscores/201702050atl.htm

import requests
from bs4 import BeautifulSoup

url = 'https://www.pro-football-
reference.com/boxscores/201702050atl.htm'
page = requests.get(url)
html = page.text

页面上的大多数表都在注释标记内,因此无法以简单的方式访问。

print(soup.table.text)

返回:

1
2
3
4
OT
Final







via Sports Logos.net
About logos


New England Patriots
0
3
6
19 
6
34





via Sports Logos.net
About logos


Atlanta Falcons
0
21
7
0
0
28

即。包含玩家统计数据的主表缺失。我试图使用

简单地删除评论标签
html = html.replace('<!--',"")
html = html.replace('-->',"")

但无济于事。如何访问这些已注释掉的表?

3 个答案:

答案 0 :(得分:3)

你走了。您可以从该页面获取任何表,仅更改索引号。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm').text

soup = BeautifulSoup(page,'lxml')
table = soup.find_all('table')[1]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))

由于除了前两个表之外的其他表都在javascript中,这就是为什么你需要使用selenium来进行gatecrash并解析它们。您现在肯定可以访问该页面中的任何表格。这是经过修改的。

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get('https://www.pro-football-reference.com/boxscores/201702050atl.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
table = soup.find_all('table')[7]  #This is the index of any table of that page. If you change it you can get different tables.
tab_data = [[celldata.text for celldata in rowdata.find_all(["th","td"])]
                        for rowdata in table.find_all("tr")]
for data in tab_data:
    print(' '.join(data))

答案 1 :(得分:2)

如果其他人有兴趣从注释中获取表格而不使用硒。

You can grab all the comments,然后检查是否存在表并将该文本传递回BeautifulSoup以解析该表。

word2vec_l1, word2vec_l2 = [], []

for i in x['id'].values:
  word2vec_l1.append(word2vec(i))

# This l2 is the l1 in your question which is ['A', 'AB', 'KA', 'FF']
for i in l2:
  word2vec_l2.append(word2vec(i))

word2vec_l1 = np.array(word2vec_l1)
word2vec_l2 = np.array(word2vec_l2)

使它更健壮以确保整个表都存在于同一注释中可能是明智的。

答案 2 :(得分:1)

我可以使用Beautiful Soup和Pandas解析表格,这里有一些代码可以帮助你解决问题。

import requests
from bs4 import BeautifulSoup
import pandas as pd    

url = 'https://www.pro-football-reference.com/boxscores/201702050atl.htm'
page = requests.get(url)

soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[1]
# Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]

df现在包含:

    Quarter Time    Tm  Detail  NWE ATL
0   2   12:15   Falcons Devonta Freeman 5 yard rush (Matt Bryant kick)  0   7
1   NaN 8:48    Falcons Austin Hooper 19 yard pass from Matt Ryan (Mat...   0   14
2   NaN 2:21    Falcons Robert Alford 82 yard interception return (Mat...   0   21
3   NaN 0:02    Patriots    Stephen Gostkowski 41 yard field goal   3   21
4   3   8:31    Falcons Tevin Coleman 6 yard pass from Matt Ryan (Matt...   3   28