用BeautifulSoup解析NBA Boxscore数据的问题

时间:2015-02-11 05:56:22

标签: python web-scraping beautifulsoup

我正在尝试从EPSN解析球员级别的NBA比分数据。以下是我尝试的初始部分:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')

似乎BeautifulSoup给了我一个奇怪的结果。源代码中的最后一个“表”包含播放器数据,这就是我想要提取的内容。在线查看源代码显示该表在第421行关闭,这是在两队的得分后。但是,如果我们看看'汤',那么在迈阿密统计数据之前还有一条关闭桌子的附加线。这发生在在线源代码的第350行。

解析器'html.parser'的输出是:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4 T

BOS 25 29 22 31107MIA 31 31 31 27120

Boston Celtics
STARTERS    
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A  
FTM-A
OREB

正如你所看到的,它在'OREB'的桌子中间结束,它永远不会进入迈阿密热火队。使用'lxml'解析器的输出是:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4T

BOS 25 29 22 31107MIA 31 31 31 27120

这根本不包括盒子分数。我正在使用的完整代码(由Daniel Rodriguez提供)看起来像:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
    array = np.zeros((len(players), len(headers)+1), dtype=object)
    array[:] = np.nan
    for i, player in enumerate(players):
        cols = player.find_all('td')
        array[i, 0] = cols[0].text.split(',')[0]
        for j in range(1, len(headers) + 1):
            if not cols[1].text.startswith('DNP'):
                array[i, j] = cols[j].text

    frame = pd.DataFrame(columns=columns)
    for x in array:
        line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
        new = pd.DataFrame(line, columns=frame.columns)
        frame = frame.append(new)
    return frame

for index, row in games.iterrows():
    print(index)
    request = requests.get(BASE_URL.format(index))
    table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
    heads = table.find_all('thead')
    bodies = table.find_all('tbody')

    team_1 = heads[0].th.text
    team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
    team_1_players = get_players(team_1_players, team_1)
    players = players.append(team_1_players)

    team_2 = heads[3].th.text
    team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
    team_2_players = get_players(team_2_players, team_2)
    players = players.append(team_2_players)

players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')

我想要的输出样本是:

,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26

3 个答案:

答案 0 :(得分:3)

BeautifulSoup也为我截断了部分结果,所以我用re.findall替换了soup.find_all选项

r = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
html = r.read()
soup = BeautifulSoup(html)

statnames = re.search('STARTERS</th>.*?PTS</th>',html, re.DOTALL).group()
th = re.findall('th.*</th', statnames) # each th tag contains a statname
names = ['Name', 'Team']
for t in th:
   t = re.sub('.*>','',t)
   t = t.replace('</th','')
   names.append(t)
print names

celts = re.search('Boston Celtics.*?Total Team Turnovers',html,re.DOTALL).group()
heat = re.search('nba-small-mia floatleft.*?Total Team Turnovers',html,re.DOTALL).group()

players = str(soup).split('td nowrap')
for player in players[1:len(players)]:
   try:
       stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
   except:
       stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()] # player name
       if stats[0] in celts:
          stats.append('Boston Celtics')
       elif stats[0] in heat:
          stats.append('Miami Heat')
   td = re.findall('td.*?/td', player) # each td tag contains a stat
   for t in td:
       t = re.findall('>.*<',t)
       t = re.sub('.*>','',t[0])
       t = t.replace('<','')
       if t!='' and t!='\xc2\xa0':
          stats.append(t)
    print stats

输出=

['Name', 'Team', 'MIN', 'FGM-A', '3PM-A', 'FTM-A', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', '+/-', 'PTS']
['Kevin Garnett', 'Boston Celtics', '32', '4-8', '0-0', '1-1', '1', '11', '12', '2', '0', '2', '5', '4', '-4', '9']
['Brandon Bass', 'Boston Celtics', '28', '6-11', '0-0', '3-4', '6', '5', '11', '1', '0', '0', '1', '2', '-8', '15']
['Paul Pierce', 'Boston Celtics', '41', '6-15', '2-4', '9-9', '0', '5', '5', '5', '2', '0', '0', '3', '-17', '23']
['Rajon Rondo', 'Boston Celtics', '44', '9-14', '0-2', '2-4', '0', '7', '7', '13', '0', '0', '4', '4', '-13', '20']
['Courtney Lee', 'Boston Celtics', '24', '5-6', '1-1', '0-0', '0', '1', '1', '1', '0', '0', '1', '5', '-7', '11']
['Jared Sullinger', 'Boston Celtics', '8', '1-2', '0-0', '0-0', '0', '1', '1', '0', '0', '0', '0', '1', '-3', '2']
['Jeff Green', 'Boston Celtics', '23', '0-4', '0-0', '3-4', '0', '3', '3', '0', '1', '0', '1', '0', '-7', '3']
['Jason Terry', 'Boston Celtics', '25', '2-7', '0-3', '4-4', '0', '0', '0', '1', '1', '0', '3', '3', '-10', '8']
['Leandro Barbosa', 'Boston Celtics', '16', '6-8', '3-3', '1-2', '0', '1', '1', '1', '0', '0', '0', '1', '+4', '16']
['Chris Wilcox', 'Boston Celtics', "DNP COACH'S DECISION"]
['Kris Joseph', 'Boston Celtics', "DNP COACH'S DECISION"]
['Jason Collins', 'Boston Celtics', "DNP COACH'S DECISION"]
['Darko Milicic', 'Boston Celtics', "DNP COACH'S DECISION"]
['Shane Battier', 'Miami Heat', '29', '2-4', '2-3', '0-0', '0', '2', '2', '1', '1', '0', '0', '3', '+12', '6']
['LeBron James', 'Miami Heat', '29', '10-16', '2-4', '4-5', '1', '9', '10', '3', '2', '0', '0', '2', '+12', '26']
['Chris Bosh', 'Miami Heat', '37', '8-15', '0-1', '3-4', '2', '8', '10', '1', '0', '3', '1', '3', '+15', '19']
['Mario Chalmers', 'Miami Heat', '36', '3-7', '0-1', '2-2', '0', '1', '1', '11', '3', '0', '1', '3', '+11', '8']
['Dwyane Wade', 'Miami Heat', '35', '10-22', '0-0', '9-11', '2', '1', '3', '4', '2', '1', '4', '3', '-6', '29']
['Udonis Haslem', 'Miami Heat', '11', '0-1', '0-0', '0-0', '0', '3', '3', '0', '0', '0', '1', '1', '-2', '0']
['Rashard Lewis', 'Miami Heat', '19', '4-5', '1-2', '1-2', '0', '5', '5', '1', '0', '1', '0', '1', '+1', '10']
['Norris Cole', 'Miami Heat', '6', '1-2', '1-2', '0-0', '0', '0', '0', '1', '0', '0', '1', '2', '+5', '3']
['Ray Allen', 'Miami Heat', '31', '5-7', '2-3', '7-8', '0', '2', '2', '2', '0', '0', '0', '1', '+9', '19']
['Mike Miller', 'Miami Heat', '7', '0-0', '0-0', '0-0', '0', '0', '0', '1', '0', '0', '0', '1', '+8', '0']
['Josh Harrellson', 'Miami Heat', "DNP COACH'S DECISION"]
['James Jones', 'Miami Heat', "DNP COACH'S DECISION"]
['Terrel Harris', 'Miami Heat', "DNP COACH'S DECISION"]

赶上D.J.奥古斯丁,最简单(但并非最简洁)的代码是:

try:
    stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
except:
    stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()]

答案 1 :(得分:0)

尝试使用其他解析器(lxml):

soup = BeautifulSoup(request.text,'lxml')
tables = soup.find_all('table')

for t in tables:
    print t.text

它将更好地检测页面结构

答案 2 :(得分:0)

代码使用默认解析器返回正确的数据,如果你安装它可能会lxml

req = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(req.content)
table = soup.find_all('table')
print(table)

....................
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/2009/james-jones">James Jones</a>, SF</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr><tr align="right" class="odd player-46-6490" valign="middle">
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/6490/terrel-harris">Terrel Harris</a>, SG</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr></tbody><thead><tr align="right"><th style="text-align:left;">TOTALS</th><th></th>
<th nowrap="">FGM-A</th>
<th>3PM-A</th>
<th>FTM-A</th>
<th>OREB
</th><th>DREB</th>
<th>REB</th>
<th>AST</th>
<th>STL</th>
<th>BLK</th>
<th>TO</th>
<th>PF</th>
<th> </th>
<th>PTS</th>
</tr></thead><tbody><tr align="right" class="even"><td colspan="2" style="text-align:left"></td><td><strong>43-79</strong></td><td><strong>8-16</strong></td><td><strong>26-32</strong></td><td><strong>5</strong></td><td><strong>31</strong></td><td><strong>36</strong></td><td><strong>25</strong></td><td><strong>8</strong></td><td><strong>5</strong></td><td><strong>8</strong></td><td><strong>20</strong></td><td> </td><td><strong>120</strong></td></tr><tr align="right" class="odd"><td colspan="2" style="text-align:left"><strong></strong></td><td><strong>54.4%</strong></td><td><strong>50.0%</strong></td><td><strong>81.3%</strong></td><td colspan="13"></td></tr><tr bgcolor="#ffffff"><td align="right" colspan="15" style="padding:10px;"><div style="float: right;"><strong>Fast break points:</strong>   12<br/><strong>Points in the paint:</strong>   46<br/><strong>Total Team Turnovers (Points off turnovers):</strong>   8 (6)</div><div style="float: left;">+/- denotes team's net points while the player is on the court.</div></td></tr></tbody></table>]

使用"html.parser"给出了与问题相同的截断输出,但正如您在上面看到的那样,没有指定它可以正常工作。

使用bs4 '4.3.2'处理python 2.7和3.4,我的lxml版本为3.3.3.0

如果你没有得到最新的bs4你应该更新,你可以使用诊断方法打印出一份报告,显示不同的解析器如何处理文档,并告诉你是否遗漏了一个解析器美丽的汤可以使用:

因此,使用您的html使用以下内容获取报告:

from bs4.diagnose import diagnose
diagnose(request.text)

使用正则表达式来解析html已经被很好地记录为不是一个非常好的方法,对html和正则表达式的一个微不足道的改变可能会破坏。