BeautifulSoup - 只返回第一张桌子

时间:2017-11-22 11:24:43

标签: python web-scraping beautifulsoup

我最近一直在与BeautifulSoup合作。我试图从https://www.pro-football-reference.com/teams/mia/2000_roster.htm网站获取数据。具体来说,我想要的是玩家名称和' gs' (游戏开始)。

但是,在执行此操作时,它只会返回第一个(' Starters')表格数据。我实际上根本不对那张顶级桌子感兴趣,我希望第二张桌子名为“花名册”。

这是我正在做的代码。就像我说的那样,除了玩家名字和游戏之外,我并不真正想要/需要任何东西,但只是练习和学习BeautifulSoup。

import pandas as pd
import requests
import bs4

alpha  = requests.get('https://www.pro-football-
reference.com/teams/mia/2000_roster.htm')

beta = bs4.BeautifulSoup(alpha.text,'lxml')


gama = beta.findAll('th',{'data-stat':'pos'})
position = [th.text for th in gama]
position = position[1:]
position = list(filter(None, position))

gama = beta.findAll('td',{'data-stat':'player'})
player = [td.text for td in gama]
player = player[1:]
while 'Defensive Starters' in player: player.remove('Defensive Starters')
while 'Special Teams Starters' in player: player.remove('Special Teams 
Starters')

gama = beta.findAll('td',{'data-stat':'age'})
age = [td.text for td in gama]
age = list(filter(None, age))

gama = beta.findAll('td',{'data-stat':'gs'})
gs = [td.text for td in gama]
gs = list(filter(None, gs))

target = pd.DataFrame(

{
'player_name':player,
'position':position,
'gs':gs,
'age':age
})

任何人都知道我哪里出错了?或者可能是另一种方式来解决它?

1 个答案:

答案 0 :(得分:3)

要从该表中获取内容,您需要使用任何浏览器模拟器,因为动态生成该部分的响应。但是,无需任何浏览器模拟器即可轻松访问第一个表中的数据。在这种情况下我试过了硒:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
    player = items.select("[data-stat='player']")[0].text
    gs = items.select("[data-stat='gs']")[0].text
    print(player,gs)

driver.quit()

部分输出:

Player  GS
Trace Armstrong* 0
John Bock 1
Tim Bowens 15
Lorenzo Bromell 0
Autry Denson 0
Mark Dixon 15
Kevin Donnalley 16

出于某种原因,如果您遇到此类错误,则此时此错误将不会出现此选项:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
    player = items.select("[data-stat='player']")[0].text if items.select("[data-stat='player']") else ""
    gs = items.select("[data-stat='gs']")[0].text if items.select("[data-stat='gs']") else ""
    print(player,gs)

driver.quit()