通过python从HTML文件中获取名称

时间:2016-02-05 15:18:54

标签: python python-3.x

application.js

您好我想从此网址获取播放器名称:http://espn.go.com/nba/team/stats/_/name/por/year/2015/但它总是提供一些HTML文字,但不是播放器列表。我需要找到每个名字的第一个出现并获得一次。你能帮助我吗。谢谢。顺便说一下,我正在使用BeautifulSoup。

2 个答案:

答案 0 :(得分:1)

我使用不同的方式来获取HTML代码

source = urllib2.urlopen(link)
html = source.read()
source.close();
soup = BeautifulSoup(html, "html.parser")  

因为我注意到所有行都以数字结尾所以我使用正则表达式来提取那些行

players_table = soup.find_all("tr",{"class" : re.compile(r"\d+$")})

这里是所有代码

from bs4 import BeautifulSoup

import urllib2
import re

def get_players(team_id, year):
    link = "http://espn.go.com/nba/team/stats/_/name/{}/year/{}/".format(team_id, year)
    source = urllib2.urlopen(link)
    html = source.read()
    source.close();
    soup = BeautifulSoup(html, "html.parser")


    players_table = soup.find_all("tr",{"class" : re.compile(r"\d+$")})
    player_names = []
    for table in players_table:
        for a in table.find_all('a'):
            player_names.append(a.get_text())

    return player_names

答案 1 :(得分:0)

这部分导致问题:

game_stat_row = soup.find("tr", {"class": "colhead"})

player_names = []
for name in game_stat_row:
    player_names.append(name)

相反,假设这是一个example roster,您需要找到my-teams-table表中的所有玩家名称。工作样本:

player_names = [player.get_text() for player in soup.select("#my-teams-table a[href*=player]")]

演示:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> content = requests.get("http://espn.go.com/nba/team/roster/_/name/bos/year/2014").content
>>> soup = BeautifulSoup(content, "html.parser")
>>> 
>>> [player.get_text() for player in soup.select("#my-teams-table a[href*=player]")]
['Avery Bradley', 'Jae Crowder', 'R.J. Hunter', 'Jonas Jerebko', 'Amir Johnson', 'David Lee', 'Jordan Mickey', 'Kelly Olynyk', 'Terry Rozier', 'Marcus Smart', 'Jared Sullinger', 'Isaiah Thomas', 'Evan Turner', 'James Young', 'Tyler Zeller']