Beautifulsoup刮痧名单指数超出范围

时间:2014-09-23 23:41:02

标签: python beautifulsoup

我正在制作一个小小的个人项目,并遇到了一些障碍。我试图抓住每个球员的全名,他们的球衣号码和他们的位置。我已成功获取每个团队名单的URL列表,并且可以从每个团队名单中获得第一个玩家(开始QB),并使用此代码以所需格式打印:

import requests
import bs4
from time import sleep

root_url = 'http://espn.go.com'
index_url = root_url + '/nfl/players'

def get_nfl_team_urls():
    response = requests.get(index_url)
    soup = bs4.BeautifulSoup(response.text)
    return [a.attrs.get('href') for a in soup.select('div.span-4     a[href^=/nfl/team/roster]')]
print(get_nfl_team_urls())

def get_nfl_player_info(nfl_team_url):

    response = requests.get(root_url + nfl_team_url)
    soup = bs4.BeautifulSoup(response.text)
    team_name = soup.body.b.text
    sport = "Football"
    league = "NFL"


    for tr in soup.findAll('tr')[2:]:
        tds = tr.findAll('td')
        jersey_no = tds[0].text
        full_name = tds[1].text
        position = tds[2].text

        return {
        "Name": full_name,
        "Team": team_name,
        "No": jersey_no,
        "Position": position,
        "Sport": sport,
        "League": league
        }

if __name__ == '__main__':

    nfl_team_urls = get_nfl_team_urls()
    for nfl_team_urls in nfl_team_urls:
        print get_nfl_player_info(nfl_team_urls)

但是,我想要每支球队的所有球员。当我尝试这样做时:

import requests
import bs4
from time import sleep

root_url = 'http://espn.go.com'
index_url = root_url + '/nfl/players'

def get_nfl_team_urls():
    response = requests.get(index_url)
    soup = bs4.BeautifulSoup(response.text)
    return [a.attrs.get('href') for a in soup.select('div.span-4 a[href^=/nfl/team/roster]')]
print(get_nfl_team_urls())

def get_nfl_player_info(nfl_team_url):

    response = requests.get(root_url + nfl_team_url)
    soup = bs4.BeautifulSoup(response.text)
    team_name = soup.body.b.text
    sport = "Football"
    league = "NFL"


    for tr in soup.findAll('tr')[2:]:
        tds = tr.findAll('td')
        jersey_no = tds[0].text
        full_name = tds[1].text
        position = tds[2].text

        print {
        "Name": full_name,
        "Team": team_name,
        "No": jersey_no,
        "Position": position,
        "Sport": sport,
        "League": league
        }

if __name__ == '__main__':

    nfl_team_urls = get_nfl_team_urls()
    for nfl_team_urls in nfl_team_urls:
        get_nfl_player_info(nfl_team_urls)

它打印出来自第一支球队的所有进攻球员,然后停下来并给我错误:

追踪(最近一次通话):   文件“nfl_player_scraper.py”,第42行,in     get_nfl_player_info(nfl_team_urls)   在get_nfl_player_info中输入第26行“nfl_player_scraper.py”     full_name = tds [1] .text IndexError:列表索引超出范围

最终我想将这些输出为JSON,但是现在我只想弄清楚获取数据。我试图将每个玩家的信息存储在一个数组中并出现相同的错误,但没有任何播放器信息打印。任何帮助/提示表示赞赏。

1 个答案:

答案 0 :(得分:0)

您使用for tr in soup.findAll('tr')[2:]跳过表格的2个行。但是,表中还有2次头行(DEFENSE和SPECIAL TEAMS)。在分析了html后,找到了更好的跳过行头的方法:

for tr in soup.findAll('tr'):
    classtag = tr.get('class')
    if 'stathead' in classtag or 'colhead' in classtag:
        continue

至少在ESPN修改HTML之前,它可以正常工作。

此外,正如roippi所说,记得在最后两行纠正nfl_team_urls:)